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Abstract 

I  study  the  informational  complexity  of  active  learning  in  a  statistical  learning 
theory  framework.  Specifically,  I  derive  bounds  on  the  rates  of  convergence  achiev¬ 
able  by  active  learning,  under  various  noise  models  and  under  general  conditions 
on  the  hypothesis  class.  I  also  study  the  theoretical  advantages  of  active  learning 
over  passive  learning,  and  develop  procedures  for  transforming  passive  learning  al¬ 
gorithms  into  active  learning  algorithms  with  asymptotically  superior  label  com¬ 
plexity.  Finally,  I  study  generalizations  of  active  learning  to  more  general  forms  of 
interactive  statistical  learning. 
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IX 


Chapter  1 


Notation  and  Background 


1.1  Introduction 


In  active  learning,  a  learning  algorithm  is  given  access  to  a  large  pool  of  unlabeled  examples,  and 
is  allowed  to  request  the  label  of  any  particular  examples  from  that  pool,  interactively.  The  ob¬ 
jective  is  to  leam  a  function  that  accurately  predicts  the  labels  of  new  examples,  while  requesting 
as  few  labels  as  possible.  This  contrasts  with  passive  learning,  where  the  examples  to  be  labeled 
are  chosen  randomly.  In  comparison,  active  learning  can  often  significantly  decrease  the  work 
load  of  human  annotators  by  more  carefully  selecting  which  examples  from  the  unlabeled  pool 
should  be  labeled.  This  is  of  particular  interest  for  learning  tasks  where  unlabeled  examples  are 
available  in  abundance,  but  label  information  comes  only  through  significant  effort  or  cost. 

In  the  passive  learning  literature,  there  are  well-known  bounds  on  the  rate  of  convergence 
of  the  loss  of  an  estimator,  as  a  function  of  the  number  of  labeled  examples  observed  [e.g., 


Benedek  and 


Itai, 


1988 


Blumer  et  al 


1989.  Koltchinskii,  2006,  Kulkarni,  1989,  Long,  1995, 


Vapnik.ll998il.  However,  significantly  less  is  presently  known  about  the  analogous  rate  in  active 


learning:  namely,  the  rate  of  convergence  of  the  loss  of  an  estimator,  as  a  function  of  the  number 
of  label  requests  made  by  an  active  learning  algorithm. 

In  this  thesis,  I  will  outline  some  recent  progress  I  have  been  able  to  make  toward  understand- 
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ing  the  achievable  rates  of  convergence  by  active  learning,  along  with  algorithms  that  achieve 
them.  I  will  also  describe  a  few  of  the  many  open  problems  remaining  on  this  topic. 

The  thesis  begins  with  a  brief  survey  of  the  history  of  this  topic,  along  with  an  introduction 
to  the  formal  definitions  and  notation  that  will  be  used  throughout  the  thesis.  It  then  describes 
some  of  my  contributions  to  this  area.  To  begin,  Chapter  |2]  describes  some  rates  of  convergence 
achievable  by  active  learning  algorithms  under  various  noise  conditions,  as  quantified  by  a  new 
complexity  parameter  called  the  disagreement  coefficient.  It  then  continues  by  exploring  an  in¬ 
teresting  distinction  between  two  different  notions  of  label  complexity:  namely,  verifiable  and 
unverifiable.  This  distinction  turns  out  to  be  extremely  important  for  active  learning,  and  Chap¬ 
ter  [3]  explains  why.  Following  this,  Chapter  |U  describes  a  reductions-based  approach  to  active 
learning,  in  which  the  goal  is  to  transform  passive  learning  algorithms  into  active  learning  al¬ 
gorithms  having  strictly  superior  label  complexity.  The  results  in  that  chapter  are  surprisingly 
general  and  of  deep  theoretical  significance.  The  thesis  concludes  with  Chapter  |3  which  de¬ 
scribes  some  preliminary  work  on  generalizations  of  active  learning  to  more  general  types  of 
interactive  statistical  learning,  proving  results  at  a  higher  level  of  abstraction,  so  that  they  can 
apply  to  a  variety  of  interactive  learning  protocols. 

1.2  A  Simple  Example:  Thresholds 

We  begin  with  the  canonical  toy  example  illustrating  the  potential  benefits  of  active  learning. 
Suppose  we  are  tasked  with  finding,  somewhere  in  the  interval  [0, 1],  a  threshold  value  x\  we  are 
scored  based  on  how  close  our  guess  is  to  the  true  value,  so  that  if  we  guess  x  equals  z  for  some 
z  G  [0, 1],  we  are  awarded  1  —  \x  —  z\  points.  There  is  an  oracle  at  hand  who  knows  the  value  of 
x,  and  given  any  point  x'  G  [0, 1]  can  tell  us  whether  x'  >  x  or  x'  <  x. 

The  passive  learning  strategy  can  be  simply  described  as  taking  points  uniformly  at  random 
from  the  interval  [0, 1]  and  asking  the  oracle  whether  each  point  is  >  x  or  <  x  for  every  one.  After 
a  number  of  these  random  queries,  the  passive  learning  strategy  chooses  its  guess  somewhere 
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between  x\  =  the  largest  x'  that  it  knows  is  <  x,  and  x'2  =  the  smallest  x'  it  knows  is  >  x  (say 
it  guesses  Xl~^X2).  By  a  simple  argument,  if  the  passive  strategy  asks  about  n  points,  then  the 
expected  distance  between  x\  and  x'2  is  at  least  ^4-_  (say  for  x  =  1/2),  so  we  expect  the  passive 
strategy’s  guess  to  be  off  by  some  amount  >  2(-n1+1^ . 

On  the  other  hand,  suppose  instead  of  asking  the  oracle  about  every  one  of  these  random 
points,  we  instead  look  at  each  one  sequentially,  and  only  ask  about  a  point  if  it  is  between  the 
current  x\  and  the  current  x'2 ;  that  is,  we  only  ask  about  a  point  if  it  is  not  greater  than  a  point 
x1  known  to  be  >  x  and  not  less  than  a  point  known  to  be  <  x.  This  certainly  seems  to  be  a 
reasonable  modification  to  our  strategy,  since  we  already  know  how  the  oracle  would  respond 
for  the  points  we  choose  not  to  ask  about.  In  this  case,  if  we  ask  the  oracle  about  n  points,  each 
one  reduces  the  width  of  the  interval  [x\ ,  xt2]  at  that  moment  by  some  factor  These  n  factors 
&  are  upper  bounded  by  n  independent  Uniform([l/2, 1])  random  variables  (representing  the 
fraction  of  the  interval  on  the  larger  side  of  the  x'),  so  that  the  expected  final  width  of  [x\ ,  x2]  is 
at  most  (|)n  <  exp{-^n/4:}.  Therefore,  we  expect  this  modified  strategy’s  guess  to  be  off  by  at 


,0 


most  half  this  amountl 

As  we  will  see,  this  modified  strategy  is  a  special  case  of  an  active  learning  algorithm  I  will 


refer  to  as  CAL  (after  its  discoverers, 


Cohn.  Atlas,  and  Ladner  [1994])  or  Algorithm  0,  which 


I  introduce  in  Section  11.41  The  gap  between  the  passive  strategy,  which  can  only  reduce  the 
distance  between  the  guess  and  the  true  threshold  at  a  linear  rate  Q(w_l ),  and  the  active  strategy, 


which  can  reduce  this  distance  at  an  exponential  rate 


/  3  \n 

2  W 


,  can  be  substantial.  For  instance,  with 


n  =  20, 


i 


2(n+l) 


.024  while  \(\)n 


.0016,  better  than  an  order  of  magnitude  improvement. 


We  will  see  several  cases  below  where  these  types  of  exponential  improvements  are  achievable 
by  active  learning  algorithms  for  much  more  realistic  learning  problems,  but  in  many  cases  the 
proofs  can  be  thought  of  as  simple  generalizations  of  this  toy  example. 

'Of  course,  the  optimal  strategy  for  this  task  always  asks  about  'T| /,7'2 ,  and  thus  closes  the  gap  at  a  rate  2-n. 
However,  the  less  aggressive  strategy  I  described  here  illustrates  a  simple  case  of  an  algorithm  we  will  use  exten¬ 
sively  below. 
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1.3  Notation 


Perhaps  the  simplest  active  learning  task  is  binary  classification,  and  we  will  focus  primar¬ 
ily  on  that  task.  Let  X  be  an  instance  space,  comprising  all  possible  examples  we  may  ever 
encounter.  C  is  a  set  of  measurable  functions  h  :  X  — >  {  —  1,1},  known  as  the  concept 
space  or  hypothesis  class.  We  also  overload  this  notation  so  that  for  m  G  N  and  a  sequence 
S  =  {xi,...,xm}  G  Xm,  h(S)  =  (h(xi),  h(x2),  ■  ■  ■ ,  h(xm)).  We  denote  by  d  the  VC  di¬ 
mension  of  C,  and  by  C\m]  =  max  { h(S)  :  h  G  C}|  the  shatter  coefficient  (a.k.a.  growth 


function)  value  at  m  IVapnik , 


19981.  Generally,  we  will  refer  to  any  C  with  finite  VC  dimension 


as  a  VC  class.  D  is  a  known  set  of  probability  distributions  on  X  x  {-1,  1},  in  which  there 
is  some  unknown  target  distribution  T>Xy •  I  also  denote  by  V\X\  the  marginal  of  V  over  X . 
There  is  additionally  a  sequence  of  examples  (aq,  jq),  ( x2 , 2/2),  • . .  sampled  i.i.d.  according  to 
Dxy-  In  the  active  learning  setting,  the  yt  values  are  hidden  from  the  learning  algorithm  until 
requested.  Define  Zm  =  {(aq,  yi),  (x2, 1/2), ... ,  (xm,  Vm)},  a  finite  sequence  consisting  of  the 
first  m  examples. 

For  any  h  G  C  and  distribution  V  over  X  x  {—1, 1},  let  ert>>(h)  =  P (x,Y)~T>>{h(X)  7 -Yj, 
and  for  S  =  {(x^,  y[),  (x'2,  y'2),...,  (x'ml  y'm)}  G  (X  x  {-1,  l})m,  define  the  empirical  error 
ers(h)  =  277  I  h(x'i)  —  Vi  I  •  When  V  =  VXy  (the  target  distribution),  we  abbreviate  the 
former  by  er(h)  =  erDxY(h),  and  when  S  =  Zm,  we  abbreviate  the  latter  by  erm(h )  =  erZrn(h). 
The  noise  rate,  denoted  o(C,DXY),  is  defined  as  u(C,T>)  =  inf hecerv(h)’,  we  abbreviate  this 
by  v  when  C  and  V  =  VXY  are  clear  from  the  context  (i.e.,  the  concept  space  and  target  dis¬ 
tribution).  We  also  define  r/(x:  V)  =  P-D ( Y  =  l|x),  and  define  the  Bayes  error  rate,  denoted 
P(T>),  as  P(V)  =  Ex~u[ar] [min {r)(X]D),  1  —  r)(X-,  V )}],  which  represents  the  best  achievable 
error  rate  by  any  classifier;  we  will  also  refer  to  the  Bayes  optimal  classifier,  denoted  h*,  de¬ 
fined  as  h-v(x)  ~  2t[p(x;V)  >  1/2]  —  1;  again,  for  D  =  VXy,  we  may  abbreviate  this  as 
v(x)  =  7](x]  VXY),  (3  =  f3{VXY),  and  h*  =  h*Vxy. 

For  concept  space  hi  and  distribution  V  over  X,  for  any  measurable  h  :  X  {  —  1, 1}  and 
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any  r  >  0,  define 


Bn,v'(h,r)  —  {h1  G  H  :  P 'x~w(h{X)  ±  h\X))  <  r}. 

When  TL  —  C,  V  —  VXy[X],  or  both  are  true,  we  may  simply  write  B^fh^r),  Bn(h,r), 
or  B(h,  r )  respectively.  For  concept  space  7 ~t  and  distribution  V  over  X  x  {  —  1,  +1},  for  any 
6  G  [0,1],  define  the  e— minimal  set ,  H(e,V)  =  {h  G  H  :  erw{h)  —  <  e}.  When 

V  =  T>xv  (target  distribution)  and  is  clear  from  the  context,  we  abbreviate  this  by  'H(e)  = 
H(e,  T>xy).  For  a  concept  space  7i  and  distribution  V  over  X,  define  the  diameter  of  7 i  as 
diam(H ;  V)  =  suphl  h2gW  Fx~T>(hi(X)  h2( X));  as  before,  when  V  =  VXy[X]  and  is  clear 
from  the  context,  we  will  abbreviate  this  as  diam(H)  =  diarniTi.'.  T>xy[X\). 

Also  define  the  region  of  disagreement  of  a  concept  space  7 i.  as 

DIS(H )  =  {x  G  X  :  3h\,  h2  G  di  s.t.  hi(x)  h2(x)}. 

Also,  for  a  concept  space  H ,  distribution  V  over  X  x  {— 1,  +1},  e  G  [0, 1],  and  m  G  N,  define 
the  expected  continuity  modulus  as 

ujn(m,e,T>)  =  sup  \(erv(h1)  -  ersifif))  -  (erv(h2)  -  ers{h2))\. 

Vx~T>{x]{hi(X)^h2(X)}<e 

At  this  point,  let  us  distinguish  between  some  particular  settings,  distinguished  by  the  defini¬ 
tion  of  D  as  one  of  the  following  sets  of  distributions. 

•  Agnostic  =  {  all  V}  (the  set  of  all  joint  distributions  on  X  x  {—1,  +1}). 

•  rBenignNoise(C)  =  {V  :  u(C,V)  =  /3(V)}. 

•  7sybakov{  C,  k,  /i)  =  :  Ve  >  0,  diam(  C(e;  V);V)  <  /re*  j,  (for  any  finite  parameters 

n  >  1,  p,  >  0). 

•  Entropy^ C,  a,  p)  =  |l7  :  Vm  G  N  and  e  G  [0,  l],uc{m,  e;  V)  <  (for  any 

finite  parameters  a  >  0,  p  G  (0, 1)). 

•  UniformNoise(C)  =  {V  :  3a  E  [0, 1/2),  /  G  C  s.t.  Vx  G  X,F x>(Y  f  f(x) \X  =  x)  = 
a}. 


5 


•  Realizable^ C)  =  {V  :  3/  e  C  s.t.  erv(f)  =  0}. 

•  Realizable(C,  Vx)  =  Realizable^ C)  fl  (Z>  :  /D[A’]  =  PY},  (for  any  given  marginal 
distribution  Vx  over  X). 


Agnostic  is  the  most  general  setting  we  will  study,  and  is  referred  to  as  the  agnostic  case , 
where  O  is  the  set  of  all  joint  distributions.  However,  at  times  we  will  consider  the  other 
sets,  which  represent  various  restrictions  of  Agnostic.  In  particular,  the  set  RenignN  oisefiC ) 
essentially  corresponds  to  situations  in  which  the  lack  of  a  perfect  classifier  in  C  is  due  to 


stochasticity  of  the  labels,  not  model  misspecification.  “Jsvbakov 


tion,  introduced  by  Mammen  and  Tsybakov  [1999]  and 


C.  k,  u)  is  a  further  restric- 


Tsybakov  [2004],  which  (informally) 


represents  those  distributions  having  reasonably  low  noise  near  the  optimal  decision  bound¬ 
ary  (see  Chapter  |2]  for  further  explanations).  Entropy  n  (C,  a,  p)  represents  the  finite  entropy 
with  bracketing  condition  common  to  the  empirical  processes  literature  [e.g.,  Koltchinskii,  2006, 


van  der  Vaart  and  Wellner 


19961.  Uni f  ormN oise(C)  represents  a  (rather  artificial)  subset  of 


‘ BenignNoise(C )  in  which  every  point  has  the  same  probability  of  being  labeled  opposite  to 
the  optimal  label.  Realizable(C)  represents  the  realizable  case,  popularized  by  the  PAC  model 


of  passive  learning  llValiant , 


1984{|,  in  which  there  is  a  perfect  classifier  in  the  concept  space; 


in  this  setting,  we  will  refer  to  this  perfect  classifier  as  the  target  function,  typically  denoted 
h*.  Realizable^ C,  Vx)  represents  a  restriction  of  the  realizable  case,  which  we  will  refer  to  as 
the  fixed-distribution  realizable  case',  this  corresponds  to  learning  problems  where  the  marginal 
distribution  over  X  is  known  a  priori. 


Several  of  the  more  restrictive  sets  above  may  initially  seem  unrealistic.  However,  they 
become  more  plausible  when  we  consider  fairly  complex  concept  spaces  (e.g.,  nonparametric 
spaces).  On  the  other  hand,  some  (specifically,  UniformNoise(C )  and  Realizable^ C,VX)) 
are  basically  toy  scenarios,  which  are  only  explored  as  stepping  stones  toward  more  realistic 
assumptions. 


We  now  define  the  primary  quantities  of  interest  throughout  this  thesis:  namely,  rates  of 
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convergence,  and  label  complexity. 


Definition  1.1.  (Unverifiable  rate)  An  algorithm  A  achieves  a  rate  of  convergence  R(-.  •)  on 
expected  excess  error  with  respect  to  C  if  for  any  VXY  and  n  G  N,  if  hn  =  A[n)  is  the 
algorithm ’s  output  after  at  most  n  label  requests,  for  target  distribution  VXY,  then 

E[er(hn)]  -  v(C,VXy)  <  R(n,VXY). 

An  algorithm  A  achieves  a  rate  of  convergence  R(-,-,-)  on  confidence-bounded  excess  error 
with  respect  to  C  if  for  any  T>xy,  5  G  (0, 1),  and  n  G  N,  ifhn  =  A(n)  is  the  algorithm’s  output 
after  at  most  n  label  requests,  for  target  distribution  VXY,  then 

P (er(hn)  -  u(C,  VXY)  <  R(n,  5,  VXY))  >1-5. 


Definition  1.2.  (Verifiable  rate)  An  algorithm  A  achieves  a  rate  of  convergence  R(-,  •,  •)  on  an 
accessible  bound  on  excess  error  with  respect  to  C,  under  B  if,  for  any  VXY  G  ID),  5  G  (0, 1), 
and  n  G  N,  if  ( hn ,  en )  =  A(n )  is  the  algorithm ’s  output  after  at  most  n  label  requests,  for  target 
distribution  T>XY,  then 

P (er(hn)  -  i/(C,  VXY)  <  in  <  R(n,  5,  VXY))  >1-5. 

I  will  refer  to  Definition  1 1.21  as  a  verifiable  rate  under  D,  for  short.  If  ever  I  simply  refer  to 
the  rate ,  I  will  mean  Definitionll.il  To  distinguish  these  two  notions  of  convergence  rates,  I  may 
sometimes  refer  to  Definition!  1.  lias  the  unverifiable  rate  or  the  true  rate.  Clearly  any  algorithm 
that  achieves  a  verifiable  rate  R  also  achieves  R  as  an  unverifiable  rate.  However,  we  will  see 
interesting  cases  where  the  reverse  is  not  true. 

At  times,  it  will  be  necessary  to  express  some  results  in  terms  of  the  number  of  label  requests 
required  to  guarantee  a  certain  error  rate.  This  quantity  is  referred  to  as  the  label  complexity ,  and 
is  defined  quite  naturally  as  follows. 
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Definition  1.3.  (Unverifiable  label  complexity )  An  algorithm  A  achieves  a  label  complexity 
A(-,  •)  for  expected  error,  if  for  any  Vxy,  Ve  G  (0, 1),  Vn  >  A(e,  Vxy),  ifhn  =  A(n)  is  the 
algorithm ’s  output  after  at  most  n  label  requests,  for  target  distribution  Vxy,  then 
E [er(hn)\  <  e. 

An  algorithm  A  achieves  a  label  complexity  A(-,  •,  •)  for  confidence -bounded  error,  if  for  any 
Vxy,  Ve,  5  G  (0, 1),  Vn  >  A(e,  5,  VXy),  ifhn  =  A(n)  is  the  algorithm’s  output  after  at  most  n 
label  requests,  for  target  distribution  VXy,  then  P (er(hn)  <  e)  >  1  —  S. 

Definition  1.4.  (Verifiable  label  complexity )  An  algorithm  A  achieves  a  verifiable  label 
complexity  A(-,  •,  -)for  C  under  B  if  it  achieves  a  verifiable  rate  R  with  respect  to  C  under  B 
such  that,  for  any  VXy  G  B,  V<5  G  (0, 1),  Ve  G  (0, 1),  Vn  >  A(e,  5,  VXy),  R(n,  5,  VXy)  <  e. 


Again,  to  distinguish  between  these  definitions,  I  may  sometimes  refer  to  the  former  as  the 
unverifiable  label  complexity  or  the  true  label  complexity.  Also,  throughout  the  thesis,  I  will 
maintain  the  convention  that  whenever  I  refer  to  a  “rate  R”  or  “label  complexity  A,”  I  refer  to  the 
confidence-bounded  variety,  and  similarly  when  I  refer  to  a  “rate  R”  or  “label  complexity  A,”  in 
those  cases  I  refer  to  the  version  of  the  definition  for  expected  error  rates. 

A  brief  note  on  measurability: 

Throughout  this  thesis,  we  will  let  E  and  P  (and  indeed  any  reference  to  “probability”)  refer  to 


the  outer  expectation  and  probability  llvan  der  Vaart  and  Wellner , 


1996||,  so  that  quantities  such 


as  P (DIS(B(h,  r)))  are  well  defined,  even  if  DIS(B(h ,  r))  is  not  measurable. 


1.4  A  Simple  Algorithm  Based  on  Disagreement 


One  of  the  earliest,  and  most  elegant,  theoretically  sound  active  learning  algorithms  for  the  re 


alizable  case  was  provided  by 


Cohn.  Atlas,  and  Ladner  1 99411 .  Under  the  assumption  that  there 


exists  a  perfect  classifier  in  C,  they  proposed  an  algorithm  which  processes  unlabeled  examples 
in  sequence,  and  for  each  one  it  determines  whether  there  exists  a  classifier  in  C  consistent  with 
all  previously  observed  labels  that  labels  this  new  example  +1  and  one  that  labels  this  example 
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—1;  if  so,  the  algorithm  requests  the  label,  and  otherwise  it  does  not  request  the  label;  after  n  label 
requests,  the  algorithm  returns  any  classifier  consistent  with  all  observed  labels.  In  some  sense, 
this  algorithm  corresponds  to  the  very  least  we  could  expect  of  an  active  learning  algorithm,  as 
it  never  requests  the  label  of  an  example  it  can  derive  from  known  information,  but  otherwise 
makes  no  effort  to  search  for  informative  examples.  We  can  equivalently  think  of  this  algorithm 
as  maintaining  two  sets:  V  C  C  is  the  set  of  candidate  hypotheses  still  under  consideration,  and 
R  =  DIS(V )  is  their  region  of  disagreement.  We  can  then  think  of  the  algorithm  as  request¬ 
ing  a  random  labeled  example  from  the  conditional  distribution  of  T>xy  given  that  X  G  R,  and 
subsequently  removing  from  V  any  classifier  inconsistent  with  the  observed  label. 

Most  of  the  active  learning  algorithms  we  study  in  subsequent  chapters  will  be,  in  some 
way,  variants  of,  or  extensions  to,  this  basic  procedure.  In  fact,  at  this  writing,  all  of  the  pub¬ 
lished  general-purpose  agnostic  active  learning  algorithms  achieving  nontrivial  improvements 
are  derivatives  of  Algorithm  0.  A  formal  definition  of  the  algorithm  is  given  below. 

Algorithm  0 

Input:  hypothesis  class  7 i,  label  budget  n 

Output:  classifier  hn  e  H  and  error  bound  en 

0.  Vq  < —  q  * —  0 

1.  For  m  —  1,  2, . . . 

2.  If  El/ll,  h2  G  Vq  s.t.  hi(xm)  ^  h2{xm), 

3.  Request  ym 

4.  q  <—  q  +  1 

5.  Vq  *■  \H  G  Vq—i  .  /l(xm)  Vm } 

6.  If  q  =  n.  Return  an  arbitrary  classifier  hn  G  Vn  and  value  en  =  diam( Vn) 

One  of  the  most  appealing  properties  of  this  algorithm,  besides  its  simplicity,  is  the  fact  that 
it  makes  extremely  efficient  use  of  the  unlabeled  examples;  in  fact,  supposing  the  algorithm 
processes  m  unlabeled  examples  before  returning,  we  can  take  the  classifier  hn  and  label  all  of 
the  examples  we  skipped  over  (i.e.,  those  we  did  not  request  the  labels  of);  this  actually  produces 
a  set  of  m  perfectly  labeled  examples,  which  we  can  feed  into  our  favorite  passive  learning 
algorithm,  even  though  we  only  requested  the  labels  of  a  subset  of  those  examples.  This  fact 
also  provides  a  simple  proof  that  er(hn )  can  be  bounded  by  a  quantity  that  decreases  to  zero  (in 
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probability)  with  n:  namely,  diam(Vn).  However, 


Cohn  et  al 


1 1994]  did  not  provide  any  further 


characterization  of  the  rates  achieved  by  this  algorithm  in  general.  For  this,  we  must  wait  until 
Chapter[2l  where  I  provide  the  first  general  characterization  of  the  rates  achieved  by  this  method 
in  terms  of  a  quantity  I  call  the  disagreement  coefficient. 


1.5  A  Lower  Bound 


When  beginning  an  investigation  into  the  achievable  rates,  it  is  natural  to  first  ask  what  we  can 
possibly  hope  to  achieve,  and  what  results  are  definitely  not  possible.  That  is,  what  are  some 
fundamental  limits  on  what  this  type  of  learning  is  capable  of.  This  type  of  question  was  inves¬ 
tigated  by  Kulkarni  et  al.  [1993]  in  a  more  general  setting.  Informally,  the  reasoning  is  that  each 
label  request  can  communicate  at  most  one  bit  of  information.  So  the  best  we  can  hope  for  is 
something  logarithmic  in  the  “size”  of  the  hypothesis  class.  Of  course,  for  infinite  hypothesis 
classes  this  makes  no  sense,  but  with  the  help  of  a  notion  of  cover  size,  Kulkarni  et  al.  [1993] 
were  able  to  prove  the  analogous  result. 


Specifically,  let  N(e)  be  the  size  of  the  smallest  set  V  of  classifiers  in  C  such  that  \/h  e 
C,  3 h'  G  V  :  Fx~v[h(X)  ^  h'(X)\  <  e,  for  some  distribution  D  over  X.  Then  any  achievable 
label  complexity  A  has  the  property  that  Ve  >  0,  sup  A(e,  6,  VXy)  >  log2[(l-5)iV(2e)]. 

T>xY£9tea'liza'ble(fC,'D) 


Since  we  can  often  get  a  reasonable  estimate  of  N (e)  by  its  distribution-free  upper  bound 


2  (X  In  Xy‘  1 1  laussler 


199211.  we  can  often  expect  our  rates  to  be  at  best  exp  {— cn/d }  for  some 


constant  c.  In  particular,  rather  than  working  with  N(e)  in  the  results  below,  I  will  typically 
formulate  upper  bounds  in  terms  of  d;  in  most  of  these  cases,  some  variant  of  log  N(e)  could 
easily  be  substituted  to  achieve  a  tighter  bound  (by  using  the  cover  as  a  hypothesis  class  instead 
of  the  full  space),  closer  in  spirit  to  this  lower  bound. 
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1.6  Splitting  Index 


Over  the  past  decade,  several  special-purpose  active  learning  algorithms  were  proposed,  but 
notably  lacking  was  a  general  theory  of  convergence  rates  for  active  learning.  This  changed  in 


2005  when  Dasgupta  published  his  theory  of  splitting  indices  [Dasgupta, 


As  before,  this  section  is  restricted  to  the  realizable  case.  Let  Q  C  {{hi,  h2}  :  hi,  h2  G  C} 
be  a  finite  set  of  unordered  pairs  of  classifiers  from  C.  For  x  G  X  and  y  G  {  —  1,  +1},  define 
Qx  —  {{^i,  ^2}  e  Q  :  hi(x)  —  h2(x)  =  y}.  A  point  x  G  X  is  said  to  p-split  Q  if 

max  IQ"!  <  (1  -  p)\Q\- 

j/Gf-1,+1} 

Wesay7 H  C  Cis  (p,  A,  r)-splittable  if  for  all  finite  Q  C  {{/i1,/i2}  CC:P(/ii(X)  ^h2(X))  >  A}, 

P(A  p-splits  Q)  >  t. 

A  large  value  of  p  for  a  reasonably  large  r  indicates  that  there  are  highly  informative  examples 
that  are  not  too  rare.  Dasgupta  effectively  proves  the  following  results. 

Theorem  1.5.  For  any  VC  class  C,  for  some  universal  constant  c  >  0,  there  is  an  algorithm 
with  verifiable  label  complexity  A  for  ‘Jlealizable( C)  such  that,  for  any  e  G  (0, 1),  5  G  (0, 1), 
andVxr  G  "Realizable^ C),  if  B{h* ,  4A)  is  (p,  A,T)-splittablefor  all  A  >  e/2,  then 
A(e.«,©xy)<cnog^logi 


The  value  p  has  been  referred  to  as  the  splitting  index.  It  can  be  useful  for  quantifying 


the  verifiable  rates  for  a  variety  of  problems  in  the  realizable  case.  For  example, 


Dasgupta 


1 2005]  uses  it  to  analyze  the  problem  where  C  is  the  class  of  homogeneous  linear  separators  in  d 


dimensions,  and  T>Xy[X]  =  V  is  the  uniform  distribution  on  the  unit  d-dimensional  sphere.  He 
shows  that  this  problem  is  (1/2,  e,  e)-splittable  for  any  e  >  0  for  any  target  in  C.  This  implies  a 
verifiable  rate  for  Realizable^ C,  V)  of 

R{n,  5,  T>xy)  oc  ^  •  exp 
0 


for  a  constant  d  >  0.  This  rate  was  previously  known  for  other  algorithms  [e.g., 


Dasgupta.  etal.. 


2005],  but  had  not  previously  been  derived  as  a  special  case  of  such  a  general  analysis. 
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1.7  Agnostic  Active  Learning 


Though  each  of  the  preceding  analyses  provides  valuable  insights  into  the  nature  of  active  learn¬ 
ing,  they  also  suffer  the  drawback  of  reliance  on  the  realizability  assumption.  In  particular,  that 
there  is  no  label  noise,  and  that  the  Bayes  optimal  classifier  is  in  C,  are  severe  and  often  unreal¬ 
istic  assumptions.  We  would  ideally  like  an  analysis  of  the  agnostic  case  as  well.  However,  the 
aforementioned  algorithms  (e.g.,  CAL,  and  the  algorithm  achieving  the  splitting  index  bounds) 
no  longer  function  properly  in  the  presence  of  nonzero  noise  rates.  So  we  need  to  start  from  the 
basics  and  build  new  techniques  that  are  robust  to  noise  conditions. 

To  begin,  we  may  again  ask  what  we  might  hope  to  achieve.  That  is,  are  there  fundamental 
information-theoretic  limits  on  what  we  can  do  with  this  type  of  learning?  This  question  was 
investigated  by  Kaariainen  [2006].  In  particular,  he  was  able  to  prove  that  for  basically  any 
nontrivial  marginal  V  over  X,  noise  rate  v,  number  n,  and  active  learning  algorithm,  there  is 
some  distribution  VXy  with  marginal  V  and  noise  rate  v  such  that  the  algorithm’s  achieved  rate 
R(n ,  5,  T>Xy )  at  n  satisfies  (for  some  constant  c  >  0) 


R(n,  5,  VXy)  >  c 


v 2  log(l/<5) 


n 


Furthermore,  this  result  was  improved  by 


Beygelzimer.  Dasgupta.  and  Langford  [2009]  to 


R(n,  3/4,  DXy )  >  c 


Considering  that  rates  oc  y  vdiog(i/s)  are  achievable  in  passive  learning,  this  indicates  that, 
even  for  concept  spaces  that  had  exponential  rates  in  the  realizable  case,  any  bound  on  the  veri¬ 
fiable  rates  that  shows  significant  improvement  (more  than  a  multiplicative  factor  of  \Jv)  in  the 
dependence  on  n  for  nonzero  noise  rates  must  depend  on  T)Xy  in  more  than  simply  the  noise 
rate. 
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Chapter  2 


Rates  of  Convergence  in  Active  Learning 


In  this  chapter,  we  study  the  rates  of  convergence  in  generalization  error  achievable  by  active 
learning  under  various  types  of  label  noise.  Additionally,  we  study  the  more  general  problem  of 
active  learning  with  a  nested  hierarchy  of  hypothesis  classes,  and  propose  an  algorithm  whose 
error  rate  provably  converges  to  the  best  achievable  error  among  classifiers  in  the  hierarchy  at  a 
rate  adaptive  to  both  the  complexity  of  the  optimal  classifier  and  the  noise  conditions.  In  partic¬ 
ular,  we  state  sufficient  conditions  for  these  rates  to  be  dramatically  faster  than  those  achievable 
by  passive  learning. 


2.1  Introduction 


There  have  recently  been  a  series  of  exciting  advances  on  the  topic  of  active  learning  with 
arbitrary  classification  noise  (the  so-called  agnostic  PAC  model),  resulting  in  several  new  al¬ 


gorithms  capable  of  achieving  improved  convergence  rates  compared  to  passive 


der  certain  conditions.  The  first,  proposed  by 


Balcan.  Bevgelzimer,  and  Langford 


earning  un 


1 2006 1  was 


the  A2  (agnostic  active)  algorithm,  which  is  provably  never  significantly  worse  than  passive 
learning  by  empirical  risk  minimization.  This  algorithm  was  later  analyzed  in  more  detail 


in  BHanneke . 


2007b],  where  it  was  found  that  a  complexity  measure  called  the  disagreement 
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coefficient  characterizes  the  worst-case  convergence  rates  achieved  by  A2  for  any  given  hypothe¬ 
sis  clas^datadi^tributwnj^n^best  achievable  error  rate  in  the  class.  The  next  major  advance  was 


by  Dasgupta.  Hsu.  and  Monteleom  [2.007],  who  proposed  a  new  algorithm,  and  proved  that  it  im¬ 


proves  the  dependence  of  the  convergence  rates  on  the  disagreement  coefficient  compared  to  A2. 
Both  algorithms  are  defined  below  in  Section  lT2l  While  all  of  these  advances  are  encouraging, 
they  are  limited  in  two  ways.  First,  the  convergence  rates  that  have  been  proven  for  these  algo¬ 
rithms  typically  only  improve  the  dependence  on  the  magnitude  of  the  noise  (more  precisely,  the 
noise  rate  of  the  hypothesis  class),  compared  to  passive  learning.  Thus,  in  an  asymptotic  sense, 
for  nonzero  noise  rates  these  results  represent  at  best  a  constant  factor  improvement  over  passive 
learning.  Second,  these  results  are  limited  to  learning  with  a  fixed  hypothesis  class  of  limited 
expressiveness,  so  that  convergence  to  the  Bayes  error  rate  is  not  always  a  possibility. 


On  the  first  of  these  limitations,  some  recent  work  by 


Castro  and  Nowak  [2006]  on  learn¬ 


ing  threshold  classifiers  discovered  that  if  certain  parameters  of  the  noise  distribution  are  known 
(namely,  parameters  related  to  Tsybakov’s  margin  conditions),  then  we  can  achieve  strict  im¬ 
provements  in  the  asymptotic  convergence  rate  via  a  specific  active  learning  algorithm  designed 
to  take  advantage  of  that  knowledge  for  thresholds.  That  work  left  open  the  question  of  whether 
such  improvements  could  be  achieved  by  an  algorithm  that  does  not  explicitly  depend  on  the 
noise  conditions  (i.e.,  in  the  agnostic  setting),  and  whether  this  type  of  improvement  is  achiev¬ 
able  for  more  general  families  of  hypothesis  classes.  In  a  personal  communication,  John  Lang¬ 
ford  reported  that  he  and  Rui  Castro  determined  such  improvements  are  in  fact  achieved  by 
A2  for  the  special  case  of  threshold  classifiers.  However,  there  remained  an  open  question  of 
whether  such  rate  improvements  could  be  generalized  to  hold  for  arbitrary  hypothesis  classes. 
In  Section  12.31  we  provide  precisely  this  generalization.  We  analyze  the_rates  achieved  by  A2 


under  Tsybakov’s  noise  conditions  [Mammen  and  Tsybakov,  1999, 


Tsvbakov 


2004;  in  par¬ 


ticular,  we  find  that  these  rates  are  strictly  superior  to  the  known  rates  for  passive  learning, 
when  the  disagreement  coefficient  is  small.  We  also  study  a  novel  modification  of  the  algorithm 
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Dasgupta.  Hsu.  and  Monteleoni  |2007|],  proving  that  it  improves  upon  the  rates  of  A2  in  its 


of 

dependence  on  the  disagreement  coefficient. 

Additionally,  in  Section  l2~4l  we  address  the  second  limitation  by  proposing  a  general  model 
selection  procedure  for  active  learning  with  an  arbitrary  structure  of  nested  hypothesis  classes. 
If  the  classes  each  have  finite  complexity,  the  error  rate  for  this  algorithm  converges  to  the  best 
achievable  error  by  any  classifier  in  the  structure,  at  a  rate  that  adapts  to  the  noise  conditions 
and  complexity  of  the  optimal  classifier.  In  general,  if  the  structure  is  constructed  to  include 
arbitrarily  good  approximations  to  any  classifier,  the  error  converges  to  the  Bayes  error  rate  in 
the  limit.  In  particular,  if  the  Bayes  optimal  classifier  is  in  some  class  within  the  structure,  the 
algorithm  performs  nearly  as  well  as  running  an  agnostic  active  learning  algorithm  on  that  single 
hypothesis  class,  thus  preserving  the  convergence  rate  improvements  achievable  for  that  class. 


2.1.1  Tsybakov’s  Noise  Conditions 

In  this  chapter,  we  will  primarily  be  interested  in  the  sets  7sybakov( C,  re,  /i),  for  parameter 
values  fi  >  0  and  re  >  1.  These  noise  conditions  have  recently  received  substantial  attention 
in  the  passive  learning  literature,  as  they  describe  situations  in  which  the  asymptotic  minimax 


convergence  rate  of  passive  learning  is  faster  than  the  worst  case  n  1/2  rate  Te.g., 


2006,  Mammen  and  Tsybakov,  1999, 


Massart  and  Elodie  Nedelec, 


2006, 


■Coltchinskii, 


Tsybakov 


2004. 


This  condition  is  satisfied  when,  for  example, 


3//  >  0,  re  >  1  s.t.  3heC:W  e  C,  er[h')  -v>  p'P  {h(X)  ±  h\X)Y 


As  we  will  see,  the  case  where  re  =  1  is  particularly  interesting;  for  instance,  this  is  the  case 
when  h*  e  C  and  ¥{\rj(X)  —  1/2 1  >  c}  =  1  for  some  constant  c  e  (0, 1/2).  Informally,  in 
many  cases  these  conditions  can  often  be  interpreted  in  terms  of  the  relation  between  magnitude 
of  noise  and  distance  to  the  decision  boundary;  that  is,  since  in  practice  the  amount  of  noise 
in  an  example’s  label  is  often  inversely  related  to  the  distance  from  the  decision  boundary,  a 
re  value  of  1  may  often  result  from  having  low  density  near  the  decision  boundary  (i.e.,  large 
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margin);  when  this  is  not  the  case,  the  value  of  k  is  essentially  determined  by  how  quickly  7/ ( x 


changes  as  x  approaches  the  decision 


2006,  Mammen  and  Tsybakov,  1999, 


boundary.  See. 


Castro  and  Nowak. 


M assart  and  Elodie  Nedelec, 


2006, 


2006,  Ko 


Tsvbakov 


tchinskii, 


200411  for 


further  interpretations  of  this  margin  condition. 

It  is  known  that  when  these  conditions  are  satisfied  for  some  k  >  1  and  /r  >  0,  the  passive 
learning  method  of  empirical  risk  minimization  achieves  a  convergence  rate  guarantee,  holding 
with  probability  >  1  —  6,  of 

.  /  dlog(n/5)\  2k~1 

er( arg mm ern(h))  —  v  <  c  -  , 

hec  \  n  J 

where  c  is  a  (n  and  //  -dependent)  constant  [Koltchinskii,  2006,  Mammen  and  Tsybakov,  1999, 


Massart  and  Elodie  Nedelec, 


2006].  Furthermore,  for  some  hypothesis  classes,  this  is  known  to 


be  a  tight  bound  (up  to  the  log  factor)  on  the  minimax  convergence  rate,  so  that  there  is  no  passive 
learning  algorithm  for  these  classes  for  which  we  can  guarantee  a  fasterconvergence  rate,  given 


that  the  guarantee  depends  on  T>x y  only  through  //  and  k  IITsybakov 


200411 . 


2.1.2  Disagreement  Coefficient 

Central  to  the  idea  of  Algorithm  0,  and  the  various  generalizations  there-of  we  will  study,  is 
the  idea  of  the  region  of  disagreement  of  the  version  space.  Thus,  a  quantification  of  the  per¬ 
formance  of  these  algorithms  should  hinge  upon  a  description  of  how  quickly  the  region  of 
disagreement  collapses  as  the  algorithm  processes  examples.  This  rate  of  collapse  is  precisely 


captured  by  a  notion  introduced  in  [Hanneke, 


2007b],  called  the  disagreement  coefficient.  It  is 


a  measure  of  the  complexity  of  an  active  learning  problem,  which  has  proven  quite  useful  for 


algorithms  of 

Balcan.  Bevgelzimer.  anc 

Langford 

[20061 

Bevgelzimer  Dasguota.  and  Langford 

2009] 

Cohn.  Atlas,  and  Ladner 

LI  994] 

,  Dasguota.  Hsu.  and  Monteleoni 

[2QQ.7].  Informally,  it 

quantifies  how  much  disagreement  there  is  among  a  set  of  classifiers  relative  to  how  close  to 


16 


some  h  they  are.  The  following  is  a  version  of  its  definition,  which  we  will  use  extensively 
below. 


Definition  2.1.  The  disagreement  coefficient  ofh  with  respect  to  C  under  VXy[X]  is 

P  (DIS(B(h,r))) 

Uh  sup  . 

r>ro  T 

where  r0  can  either  be  defined  as  0,  giving  a  coarse  analysis,  or  for  a  more  subtle  analysis  we 
can  take  it  to  be  a  function  ofn,  the  number  of  labels  (see  Section\2.7.1\for  such  a  definition 
valid  for  the  main  theorems  of  this  chapter:  l2.iiH2.i5D. 

We  further  define  the  disagreement  coefficient  for  the  hypothesis  class  C  with  respect  to  the 
target  distribution  VXy  as  6  =  lim  sup^^  6h(k),  where  {h(k>}  is  any  sequence  ofh(k)  6  C  with 
er(h (fe))  monotonically  decreasing  to  v. 


In  particular,  we  can  always  bound  the  disagreement  coefficient  by  supheC  9h  >  9. 

Because  of  its  simple  intuitive  interpretation,  measuring  the  amount  of  disagreement  in  a  local 
neighborhood  of  some  classifier  h,  the  disagreement  coefficient  has  the  wonderful  property  of 
being  relatively  simple  to  calculate  for  a  wide  range  of  learning  problems,  especially  when  those 
problems  have  some  type  of  geometric  representation.  To  illustrate  this,  we  will  go  through  a 


few  simple  examples,  taken  from  I  Han n eke , 


I2QQ7H1- 


Consider  the  hypothesis  class  of  thresholds  hz  on  the  interval  [0, 1]  (for  z  &  [0, 1]),  where 
hz(x)  =  +1  iff  x  >  z.  Furthermore,  suppose  T>xy[X]  is  uniform  on  [0, 1].  In  this  case,  it  is 
clear  that  the  disagreement  coefficient  is  at  most  2,  since  the  region  of  disagreement  of  B(hz,  r ) 
is  roughly  {x  6  [0, 1]  :  |x  —  z\  <  r}.  That  is,  since  the  disagreement  region  grows  at  rate  1  in 
two  disjoint  directions  as  r  increases,  the  disagreement  coefficient  9bz  =  2  for  any  z  e  (0, 1). 

As  a  second  example,  consider  the  disagreement  coefficient  for  intervals  on  [0, 1].  As  before, 
let  X  =  [0, 1]  and  T>xy[X]  be  uniform,  but  this  time  C  is  the  set  of  intervals  fa,b]  such  that  for 
x  e  [0, 1],  I[atb](x)  =  +1  iff  x  €  [a,  b ]  (for  a,  b  e  [0, 1],  a  <  b).  In  contrast  to  thresholds,  the 
disagreement  coefficients  9h  for  the  space  of  intervals  vary  widely  depending  on  the  particular  h. 
In  particular,  take  any  h  =  faj;.  where  0  <  a  <  b  <  1.  In  this  case,  9h  <  max  <  max^  4  >. 
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To  see  this,  note  that  when  r0  <  r  <  b  —  a,  every  interval  in  B(I[atb],r)  has  its  lower  and 
upper  boundaries  within  r  of  a  and  b,  respectively;  thus,  F(DIS(B(fa^,r)))  <  4 r.  How¬ 
ever,  when  r  >  max{r0,6  —  a},  every  interval  of  width  <  r  —  (b  —  a)  is  in  B(fa^,r),  so 
F(DIS(B(I[atb],r)))  =  l. 


As  a  slightly  more  involved  example,  consider  the  following  theorem. 


Theorem  2.2.  / 

Hanneke 

2007b 

]  If  X  is  the  surface  of  the  origin-centered  unit  sphere  in  W1  for 

d  >  2,  C  is  the  space  of  linear  separators  whose  decision  surface  passes  through  the  origin,  and 
Dxy  \X]  is  the  uniform  distribution  on  X,  then  V//  €  C  the  disagreement  coefficient  9h  satisfies 

7 r\/d,  —  1  <  6h  <  min  <  ny/d,  — 

A)  J  l  r0 


Proof.  First  we  represent  the  concepts  in  C  as  weight  vectors  w  £  Rd  in  the  usual  way.  For 
Wi,w2  £  C,  by  examining  the  projection  of  VXy[X]  onto  the  subspace  spanned  by  {wi,w2}, 
we  see  that  P(x  :  sign(w i  •  x)  f  sign(w2  ■  x))  =  aiccos^1"wO  _  Thus,  for  any  w  £  C  and 
r  <  1/2,  B(w,  r )  =  {w'  :  w  ■  w'  >  cosfirr)}.  Since  the  decision  boundary  corresponding  to  w' 
is  orthogonal  to  the  vector  w',  some  simple  trigonometry  gives  us  that 


DIS(B(w,  r))  =  {x  £  X  :  \x  ■  w\  <  sinfTir)}. 


Letting  A(d,  R )  =  2nd/ denote  the  surface  area  of  the  radius- A  sphere  in  Wl,  we  can  express 
rl  2  J 

the  disagreement  rate  at  radius  r  as 

P  (DIS(B(w,r))) 


-t  i*sin(nr)  p  /  d\  />sin(nr)  d_2 

A  (d  —  1,a/1  —  x2 )  dx  =  ]  -  j  (l  —  x2)  2  dx  (*) 


A(d,  1)  J —sin^it r) 


< 


^  ^  2  )  J —sinifKr) 

r  (f ) 


vW(¥) 


Y^-2sin(nr)  <  V d  —  2sin{'Kr)  <  y/dnr. 


For  the  lower  bound,  note  thatP (DIS(B(w,  1/2)))  =  1  so  9W  >  min  \  2,  d-  [>,  and  thus  we  need 
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only  consider  r0  <  |.  Supposing  r0  <  r  <  |,  note  that  (*)  is  at  least 


7  /•sin(irr) 

/ 

'  —sin(nr) 


(l  —  x2) 2  dx  > 


r'sin(nr)  {~T 

J-e^dx 

—sin(nr)  * 

>  -  min  <  Vdsin(nr) 
~  2  12’  v  ' 


>  -  min  1  1,  iwfdr 
4 


□ 


The  disagreement  coefficient  has  many  interesting  properties  that  can  help  to  bound  its  value 
for  a  given  hypothesis  class  and  distribution.  We  list  a  few  elementary  properties  below.  Their 
proofs,  which  are  quite  short  and  follow  directly  from  the  definition,  are  left  as  easy  exercises. 


Lemma  2.3.  [Close  Marginals] [Hanneke 


2007H1  Suppose  3  A  G  (0, 1]  s.t.  for  any  measurable 


set  A  C  X,  AP t>x{A)  <  ¥v'x{A)  <  ^Px>A. (A).  Let  h  :  X  — >  {  —  1, 1}  be  a  measurable  classifier, 
and  suppose  6h  and  6'h  are  the  disagreement  coefficients  for  h  with  respect  to  C  under  T>x  and 
U'x  respectively.  Then 

a 2eh  <o'h<  1 eh . 


Lemma  2.4.  [Finite  Mixtures]  Suppose  3a  G  [0, 1]  s.t.  for  any  measurable  set  AC  X, 

Fx>x{A)  =  aFvfiA)  +  (1  —  a)Px>2(Al).  For  a  measurable  h  :  X  — »  {  —  1, 1},  let  9^  be  the 
disagreement  coefficient  with  respect  to  C  under  V\,  9h  be  the  disagreement  coefficient  with 
respect  to  C  under  IT,  and  Oh  be  the  disagreement  coefficient  with  respect  to  C  under  T>x-  Then 


0h  <  0™  +  o i2). 


Lemma  2.5.  [Finite  Unions]  Suppose  I.  G  Ci  fl  C2  is  a  classifier  s.t.  the  disagreement 
coefficient  with  respect  to  Ci  under  T>x  is  0^  and  with  respect  to  C2  under  T>x  is  Then  if 
Oh  is  the  disagreement  coefficient  with  respect  to  C  =  Ci  U  C2  under  Vx,  we  have  that 


max  {^1},  Of  }  <  0h  <  Of  +  Of. 


The  disagreement  coefficient  has  deep  connections  to  several  other  quantities,  sue 


bling  dimension  [Li  and  Long,  2007]  and  VC  dimension  [Vapnik, 


1982].  See  [Hanneke, 


l  as  dou- 


2QQ7.q], 
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[Dasgupta.  Hsu,  and  Monteleoni 


2007], 


I  Bevgelzimer.  Dasgupta.  and  Langford. 


|  Balcan.  Hanneke.  and  Wortman, 


2008],  and 


2009]  for  further  discussions  of  various  uses  of  the  dis¬ 


agreement  coefficient  and  related  notions  and  extensions  in  active  learning.  In  particular, 


Bevgelzimer,  Dasgupta,  and  Langford  [2009]  present  an  interesting  analysis  using  a  natural  ex¬ 


tension  of  the  disagreement  coefficient  to  study  active  learning  with  a  larger  family  of  loss  func¬ 
tions  beyond  0  —  1  loss.  As  a  related  aside,  although  the  focus  of  this  thesis  is  active  learning, 
interestingly  the  disagreement  coefficient  also  has  applications  in  the  analysis  of  passive  learn¬ 
ing;  see  Section l2~9l for  an  interesting  example  of  this. 


2.2  General  Algorithms 

The  algorithms  described  below  for  the  problem  of  active  learning  with  label  noise  each  represent 
noise-robust  variants  of  Algorithm  0.  They  work  to  reduce  the  set  of  candidate  hypotheses,  while 
only  requesting  the  labels  of  examples  in  the  region  of  disagreement  of  these  candidates.  The 
trick  is  to  only  remove  a  classifier  from  the  candidate  set  once  we  have  high  statistical  confidence 
that  it  is  worse  than  some  other  candidate  classifier  so  that  we  never  remove  the  best  classifier. 
However,  the  two  algorithms  differ  somewhat  in  the  details  of  how  that  confidence  is  calculated. 


2.2.1  Algorithm  1 


The  first  algorithm,  originally  proposed  by 


Balcan,  Bevgelzimer.  and  Langford  [2006],  is  typi 


cally  referred  to  as  A2  for  Agnostic  Active.  This  was  historically  the  first  general-purpose  ag¬ 
nostic  active  learning  algorithm  shown  to  achieve  improved  error  guarantees  for  certain  learning 
problems  in  certain  ranges  of  n  and  v.  A  version  of  the  algorithm  is  described  below. 
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Algorithm  1 

Input:  hypothesis  class  C,  label  budget  n,  confidence  5 
Output:  classifier  h 

0.  V  <-  C,  R  <-  DIS(C),  g  ^  0,  m  ^  0 

1.  For  t  —  1,2, ...  ,n 

2.  If  F(DIS(V))  <  ip (R) 

3.  R  <—  DIS(V);Q  0 

4.  If  P  (R)  <  2~n,  Return  any  heV 

5.  m  <—  min{m'  >  m  :  Xmi  G  R} 

6.  Request  Ym  and  let  Q  <—  Q  U  {(Xm,  Ym)} 

7.  V  ^  {h  E  V  :  LB(h,  Q,  5/n)  <  min  Q,  5/n)} 

h'EV 

8.  <—  argmin  UB(h,  Q,  5/n) 

9.  fa  <-  ( UB(ht,Q,5/n )  -  min  LB(h,  Q,8/n) )P(i2) 

h£V 

10.  Return  hn  =  hr,  where  t  =  argmin  6t 

te{l,2,...,n} 

Algorithm  1  is  defined  in  terms  of  two  functions:  UB  and  LB.  These  represent  upper  and 
lower  confidence  bounds  on  the  error  rate  of  a  classifier  from  C  with  respect  to  an  arbitrary 
sampling  distribution,  as  a  function  of  a  labeled  sequence  sampled  according  to  that  distribution. 
As  long  as  these  bounds  satisfy 


P z~2>™{V/i  G  C,  LB(h,  Z,  5)  <  erv(h)  <  UB(h,  Z,5)}>  1-5 


for  any  distribution  V  over  X  x  {  —  1, 1}  and  any  5  G  (0, 1/2),  and  UB  and  LB  converge  to 
each  other  as  m  grows,  this  algorithm  is  known  to  be  correct,  in  that  er(h)  —  v  converges  to  0  in 


probability  [Balcan.  Beygcl/imcr.  and  Langford, 


2006].  For  instance,  Balcan,  Beygelzimer,  and 


Langford  suggest  defining  these  functions  based  on  classic  results  on  uniform  convergence  rates 


in  passive  learning  [Vapnik, 


1982],  such  as 


UB(h,Q,5 )  =  min {erQ(h)  +  L?(|Q|,  A),  1},  LB(h,Q,5 )  =  ma x{erQ(h)  -  G(\Q\,  S),  0}, 

(2.1) 


i  /in  4  |  d In  2em 

where  G(m,  5)  =  —  +  y  — - — - — and  by  convention  G(0,  5)  =  oo.  This  choice  is  justified 


by  the  following  lemma,  due  to 


Vapnik]  111  998ll. 
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Lemma  2.6.  For  any  distribution  T>  over  X  x  {—1,1},  and  any  6  >  0  and  m  E  N,  with 
probability  >1  —  5  over  the  draw  of  Z  ~  Vrn,  every  h  E  C  satisfies 

| erz(h)  —  erD(h)\  <  G(m,5).  (2.2) 

To  avoid  computational  issues,  instead  of  explicitly  representing  the  sets  V  and  R,  we  may 

implicitly  represent  it  as  a  set  of  constraints  imposed  by  the  condition  in  Step  7  of  previous 

iterations.  We  may  also  replace  f>(DIS(V ))  and  P (R)  by  estimates,  since  these  quantities  can  be 

estimated  to  arbitrary  precision  with  arbitrarily  high  confidence  using  only  unlabeled  examples. 


2.2.2  Algorithm  2 


The  second  algorithm  we  study  was  originally  proposed  by  Dasgupta.  Hsu,  and  Monteleoni  [2007] . 


It  uses  a  type  of  constrained  passive  learning  subroutine,  Learn,  defined  as  follows. 


LEARNC(£,  Q)  =  argmin  ern(h). 

h£C:erc(h)=0 


By  convention,  if  no  h  E  C  has  ercifi)  =  0,  LEARNc(£,  Q )  =  0. 

Algorithm  2 

Input:  hypothesis  class  C,  label  budget  n,  confidence  5 

Output:  classifier  h,  set  of  labeled  examples  £,  set  of  labeled  examples  Q 

0.  £  ^  0,  Q  ^  0 

1.  For  m  —  1,  2, . . . 

2.  If  \Q\  —  n  or  |£|  =  2n,  Return  h  =  LearNc(£,  Q )  along  with  £  and  Q 

3.  For  each  y  E  {  —  1,  +1},  let  =  LearNc(£  U  {{Xm:  y)}:  Q ) 

4.  If  some  y  has  h Gy)  =  0  or 

erCuQ(h{~y) )  -  erCUQ(h{y))  >  Am_i (£,  Q,  hfy  h^v]  5) 

5.  Then  £  <—  £  U  {{Xm,  y)} 

6.  Else  Request  the  label  Ym  and  let  Q  <—  Q  U  {(Xm,  Ym )} 

Algorithm  2  is  defined  in  terms  of  a  function  Am(£,  Q,  h^y\  h^~y\  5),  representing  a  thresh¬ 


old  for  a  type  of  hypothesis  test.  This  threshold  must  be  set  carefully,  since  the  set  £  U  Q  is  not 


actually  an  i.i.d.  sample  from  V 


XY- 


Dasgupta.  Hsu,  and  Monteleoni  [2007]  suggest  defining  this 


function  as 


Am(£,  Q,  h^y\  h(~y\5)  =P2m  +  fdm  (j/ercuaihM)  +  h^j  ,  (2.3) 
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where  /3m  = 


/41n(8m(m+l)C[2m]2/<5) 


and  C[2 m]  is  the  shatter  coefficient  [e.g., 


Devrove  et  al.. 


1996D;  this  suggestion  is  based  on  a  confidence  bound  they  derive,  and  they  prove  the  correct¬ 


ness  of  the  algorithm  with  this  definition.  For  now  we  will  focus  on  the  first  return  value  (the 
classifier),  leaving  the  others  for  Section  12.41  where  they  will  be  useful  for  chaining  multiple 
executions  together. 


2.3  Convergence  Rates 


In  both  of  the  above  cases,  one  can  prove  fallback  guarantees  stating  that  neither  algorithm  is  sig 
nificantly  worse  than  the  minimax  rates  for  passive  learning  IBalcan,  Bevgelzimer.  and  Langford , 


2006, 


Dasgupta.  Hsu,  and  Monteleom, 


2QQ7].  However,  it  is  even  more  interesting  to  discuss  sit¬ 


uations  in  which  one  can  prove  error  rate  guarantees  for  these  algorithms  significantly  better  than 
those  achievable  by  passive  learning.  In  this  section,  we  begin  by  reviewing  known  results  on 
these  potential  improvements,  stated  in  terms  of  the  disagreement  coefficient;  we  then  proceed  to 
discuss  new  results  for  Algorithm  1  and  a  novel  variant  of  Algorithm  2,  and  describe  the  conver¬ 
gence  rates  achieved  by  these  methods  in  terms  of  the  disagreement  coefficient  and  Tsybakov’s 
noise  conditions. 


2.3.1  The  Disagreement  Coefficient  and  Active  Learning:  Basic  Results 

Before  going  into  the  results  for  general  distributions  VXy  on  X  x  {  —  1,  +1},  it  will  be  instructive 
to  first  look  at  the  special  case  when  the  noise  rate  is  zero.  Understanding  how  the  disagreement 
coefficient  enters  into  the  analysis  of  this  simpler  case  may  aid  in  digestion  of  the  theorems  and 
proofs  for  the  general  case  presented  later,  where  it  plays  an  essentially  analogous  role.  Most  of 
the  major  ingredients  of  the  proofs  for  the  general  case  can  be  found  in  this  special  case,  albeit 
in  a  much  simpler  form.  Although  this  result  has  not  previously  been  published,  the  proof  is 


essentially  similar  to  (one  case  of)  the  analysis  of  Algorithm  1  in  BHanneke , 


2007bll . 
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Theorem  2.7.  Suppose  T>xy  G  Jiealizable(C)  for  a  VC  class  C,  and  let  f  G  C  be  such  that 
er(f)  =  0,  and  Of  <  oo.  For  any  n  G  N,  wzY/i  probability  >1  —  5  over  t/ie  draw  of  the 
unlabeled  examples,  the  classifier  hn  returned  by  Algorithm  Oafter  n  label  requests  satisfies 


er{hn)  <  2  •  exp 


n  \ 

60y(4dln(440y)  +  ln(2n/5)  J  ' 


Proof  The  case  diam( C)  =  0  is  trivial,  so  assume  diam( C)  >  0  (and  thus  d  >  1  and  Of  >  0). 
Let  Vt  denote  the  set  of  classifiers  in  C  consistent  with  the  first  t  label  requests.  If  F(DIS(Vfj)  = 
0  for  some  t  <  n,  then  the  result  holds  trivially.  Otherwise,  with  probability  1,  the  algorithm  uses 
all  n  label  requests;  in  this  case,  consider  some  t  <  n.  Let  xmt  denote  the  example  corresponding 
to  the  tth  label  request.  Let  \n  =  A0f(Ad\n(16e0f)  +  ln(2 n/8)),  t'  —  t  +  Xn,  and  let  xmt,  denote 
the  example  corresponding  to  label  request  number  t'  (assuming  t  <  n  —  \n).  In  particular,  this 
implies  |{xmt+i,  xmt+2, . . .  ,xmt,}  fl  DIS( Vt)\  >  Xn,  which  means  there  is  an  i.i.d.  sample  of 
size  Xn  from  VXy \X]  given  X  e  DIS(Vt )  contained  in  {xmt+i,  xmt+2, . . . ,  xmt,}:  namely,  the 
first  Xn  points  in  this  subsequence  that  are  in  DIS{Vt). 


Now  recal 


19891 


Vapnik, 


that,  by  classic  results  from  the  passive  learning  literature  [e.g., 


Blumer  et  al.. 


198211.  this  implies  that  on  an  event  Esj  holding  with  probability  1  —  S/n, 

Ad  In  2eA" 


sup  er(h\DIS(Vf))  < 
hevt, 


+  In  — 

f  <  V(2 0f). 

An. 


Since  Vt>  C  Vt,  this  means 


F{DIS{Vt))  <  F{DIS{B{f,¥{DIS{Vt))/{20f))))  <  F{DIS(Vt))/2. 

By  a  union  bound,  the  events  ESt  hold  for  all  t  e  {iXn  :  i  G  (0, 1, ... ,  [n/Xn J  —  1}}  with 
probability  >1  —  5.  On  these  events,  if  n  >  A„[log2(l/e)],  then  (by  induction) 

sup  er(h)  <  ¥(DIS(Vn))  <  e. 
hevn 

Solving  for  e  in  terms  of  n  gives  the  result.  □ 


24 


2.3.2  Known  Results  on  Convergence  Rates  for  Agnostic  Active  Learning 


We  will  now  describe  the  known  results  for  agnostic  active  learning  algorithms,  starting  with 
Algorithm  1.  The  key  to  the  potential  convergence  rate  improvements  of  Algorithm  1  is  that, 
as  the  region  of  disagreement  R  decreases  in  measure,  the  magnitude  of  the  error  difference 
er(h\R)  —  er(h'\R )  of  any  classifiers  h,h'  E  V  under  the  conditional  sampling  distribution 
(given  R)  can  become  significantly  larger  (by  a  factor  of  P(f?)_1)  than  er(h )  —  er(h'),  making  it 


significantly  easier  to  c 


In  particular,  [Hanneke 


etermine  which  of  the  two  is  worse  using  a  sample  of  labeled  examples. 


2007b]  developed  a  technique  for  analyzing  this  type  of  algorithm,  re¬ 


sulting  in  the  following  convergence  rate  guarantee  for  Algorithm  1 .  The  proof  follows  similar 
reasoning  to  what  we  will  see  in  the  next  subsection,  but  is  omitted  here  to  reduce  redundancy; 


see  I  Hanneke. 


2007b |  for  the  full  details. 


Theorem  2.8.  / Hanneke  2007b]  Let  hn  be  the  classifier  returned  by  Algorithm  1  when  allowed 
n  label  requests,  using  the  bounds  (12.11)  and  confidence  parameter  5  >  0.  Then  there  exists  a 
finite  universal  constant  c  such  that,  with  probability  >  1  —  5,  Vn  E  N, 


er(hn)  -  v  <c\ 


u292d  log  | 


n 


log 


n 


v292d\og\  5 


1  I  In 

I  +  7  exP  i  - 


c62d 


Similarly,  the  key  to  improvements  from  Algorithm  2  is  that  as  m  increases,  we  only  need  to 
request  the  labels  of  those  examples  in  the  region  of  disagreement  of  the  set  of  classifiers  with 
near-optimal  empirical  error  rates.  Thus,  if  F(DIS(C(e)))  shrinks  as  e  decreases,  we  expect  the 
frequency  of  label  requests  to  shrink  as  m  increases.  Since  we  are  careful  not  to  discard  the  best 
classifier,  and  the  excess  error  rate  of  a  classifier  can  be  bounded  in  terms  of  the  Am  function,  we 
end  up  with  a  bound  on  the  excess  error  which  is  converging  in  m,  the  number  of  unlcibeled  ex¬ 
amples  processed,  even  though  we  request  a  number  of  labels  growing  slower  than  m.  When  this 
situation  occurs,  we  expect  Algorithm  2  will  provide  an  improved  convergence  rate  compared 


to  passive  learning.  Using  the  disagreement  coefficient, 
prove  the  following  convergence  rate  guarantee. 


Dasgupta.  Hsu,  and  Monteleoni  [2007] 
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Theorem  2.9.  IDasg.uptCL.Hsu,  and  Monteleoni  200a!  Let  hn  be  the  classifier  returned  by 
Algorithm  2  when  allowed  n  label  requests,  using  the  threshold  (IP).  and  confidence  parameter 
6  >  0.  Then  there  exists  a  finite  universal  constant  c  such  that,  with  probability  >1  —  5, 

Vn  G  N, 


u29d  log  ±  log  ^ 


er(hn)  -v<c ,/  +  ^nog  1  •  eXp\ 


2  1 
S 


Note  that,  among  other  changes,  this  bound  improves  the  dependence  on  the  disagreement 
coefficient,  6,  compared  to  the  bound  for  Algorithm  1 .  In  both  cases,  for  certain  ranges  of  9, 
v,  and  n,  these  bounds  can  represent  significant  improvements  in  the  excess  error  guarantees, 
compared  to  the  corresponding  guarantees  possible  for  passive  learning.  However,  in  both  cases, 
when  v  >  0  these  bounds  have  an  asymptotic  dependence  on  n  of  @(n-1/2),  which  is  no  better 
than  the  convergence  rates  achievable  by  passive  learning  (e.g.,  by  empirical  risk  minimization). 
Thus,  there  remains  the  question  of  whether  either  algorithm  can  achieve  asymptotic  convergence 
rates  strictly  superior  to  passive  learning  for  distributions  with  nonzero  noise  rates.  This  is  the 
topic  we  turn  to  next. 


2.3.3  Adaptation  to  Tsybakov’s  Noise  Conditions 

It  is  known  that  for  most  nontrivial  C,  for  any  n  and  v  >  0,  for  every  active  learning  algorithm 
there  is  some  distribution  with  noise  rate  v  for  which  we  can  guarantee  excess  error  no  better 
than  cx  vn~1/2  [Kaariainen,  2006];  that  is,  the  rrir2  asymptotic  dependence  on  n  in  the  above 
bounds  matches  the  corresponding  minimax  rate,  and  thus  cannot  be  improved  as  long  as  the 
bounds  depend  on  VXy  only  via  v  (and  0).  Therefore,  if  we  hope  to  discover  situations  in  which 
these  algorithms  have  strictly  superior  asymptotic  dependence  on  n,  we  will  need  to  allow  the 
bounds  to  depend  on  a  more  detailed  description  of  the  noise  distribution  than  simply  the  noise 
rate  u. 

As  previously  mentioned,  one  way  to  describe  a  noise  distribution  using  a  more  detailed 
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parameterization  is  to  use  Tsybakov’s  noise  conditions  (7sybakov( C,  k,  /i)).  In  the  context  of 
passive  learning,  this  allows  one  to  describe  situations  in  which  the  rate  of  convergence  is  be¬ 
tween  n~l  and  n_1//2,  even  when  v  >  0.  This  raises  the  natural  question  of  how  these  active 
learning  algorithms  perform  when  the  noise  distribution  satisfies  this  condition  with  finite  y  and 
k  parameter  values.  In  many  ways,  it  seems  active  learning  is  particularly  well- suited  to  ex¬ 
ploit  these  more  favorable  noise  conditions,  since  they  imply  that  as  we  eliminate  suboptimal 
classifiers,  the  diameter  of  the  version  space  decreases;  thus,  for  small  6  values,  the  region  of 
disagreement  should  also  be  decreasing,  allowing  us  to  focus  the  samples  in  a  smaller  region  and 
accelerate  the  convergence. 


Focusing  on  the  special  case  of  one-dimensional  threshold  classifiers  under  a  uniform  marginal 


distribution, 


Castro  and  Nowak  [2006]  studied  conditions  related  to  7sybakov( C,  k,  y).  In  par¬ 


ticular,  they  studied  a  threshold-learning  algorithm  that,  unlike  the  algorithms  described  here, 

K 

takes  k  as  input,  and  found  its  convergence  rate  to  be  oc  (pp)  2k_2  when  k  >  1,  and  exp{—cn} 

_ K 

for  some  (//-dependent)  constant  c,  when  k  =  1.  Note  that  this  improves  over  the  n  2«-1  rates 


achievable  in  passive  learning  [Tsybakov 


2004 1 .  Furthermore,  they  prove  that  a  value  oc  n 


(or  exp{—  c'n},  for  some  d,  when  k  =  1)  is  also  a  lower  bound  on  the  minimax  rate.  Later,  in 
a  personal  communication,  Langford  reported  that  this  near-optimal  rate  is  also  achieved  by  Al¬ 
gorithm  1  for  the  same  learning  problem  (one-dimensional  threshold  classifiers  under  a  uniform 
marginal  distribution),  leading  to  speculation  that  perhaps  these  improvements  are  achievable  in 
the  general  case  as  well  (under  conditions  on  the  disagreement  coefficient). 


Other  than  the  one-dimensional  threshold  learning  problem,  it  was  not  previously  known 
whether  Algorithm  1  or  Algorithm  2  generally  achieves  convergence  rates  that  exhibit  these 
types  of  improvements. 
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2.3.4  Adaptive  Rates  in  Active  Learning 


The  above  observations  open  the  question  of  whether  these  algorithms,  or  variants  thereof,  im¬ 
prove  this  asymptotic  dependence  on  n.  It  turns  out  this  is  indeed  possible.  Specifically,  we  have 
the  following  result  for  Algorithm  1. 

Theorem  2.10.  Let  hn  be  the  classifier  returned  by  Algorithm  1  when  allowed  n  label  requests, 
using  the  bounds  <o>  and  confidence  parameter  5  >  0.  Suppose  further  that 
Vxy  £  7sybakov(C ,  k,  p)  for  finite  parameter  values  n  >  1  and  p  >  0  and  VC  class  C.  Then 
there  exists  a  finite  ( n-  and  p-dependent)  constant  c  such  that,  for  any  n  G  N,  with  probability 
>1-5, 

er(hn )  -  v  < 


Proof  The  case  of  diam( C)  =  0  clearly  holds,  so  we  will  focus  on  the  nontrivial  case  of 
diam{ C)  >  0  (and  therefore,  6  >  0  and  d  >  1).  We  will  proceed  by  bounding  the  label 
complexity ,  or  size  of  the  label  budget  n  that  is  sufficient  to  guarantee,  with  high  probability,  that 
the  excess  error  of  the  returned  classifier  will  be  at  most  e  (for  arbitrary  e  >  0);  with  this  in  hand, 
we  can  simply  bound  the  inverse  of  the  function  to  get  the  result  in  terms  of  a  bound  on  excess 
error. 

First  note  that,  by  Lemma  12.61  and  a  union  bound,  on  an  event  of  probability  1  —  5.  (12.21) 
holds  with  r/  =  d/n  for  every  set  Q,  relative  to  the  conditional  distribution  given  its  respective 
R  set,  for  any  value  of  n.  For  the  remainder  of  this  proof,  we  assume  that  this  1  —  5  probability 
event  occurs.  In  particular,  this  means  that  for  every  h  e  C  and  every  0  set  in  the  algorithm, 
LB(h ,  Q,5/n )  <  er(h\R)  <  UB(h,  Q ,  5/n),  for  the  set  R  that  Q  is  sampled  under.  Thus,  we 
always  have  the  invariant  that  at  all  times, 

V7  >  0,  {h  e  V  :  er(h )  —  v  <  7}  7^  0,  (2.4) 


eXP  I  cdd2  log(n/5) 
dd2  log2(n/<5)  A  2«— 2 


when  k  =  1 
when  k  >  1 


and  therefore  also  that  Vi,  er(ht)  —  v  =  ( er(ht\R )  —  infVtGy  er(h\R))¥(R)  <  (3t.  We  will  spend 
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the  remainder  of  the  proof  bounding  the  size  of  n  sufficient  to  guarantee  some  (3t  <  e. 

Recalling  the  definition  of  the  h ^  sequence  (from  Definition  12 .11).  note  that  after  step  7, 

heV  :  limsupfcP(/i(X)  ^  h^(X))  > 

<heV.  (  limsupfcP  (h(X)^hW(X))J  ^  (g)J 


/i 


c  IheV  ■ 


( diam(er(h )  —  is;  C) 


l 


/i 


> 


fp(fl)  y 


C  <  h  e  R  :  er(h )  —  v  > 


(m)\ 

\w) 


=  \  heV  :  er(h\R )  -  filler  >  P(i?)K"1(2^)- 


C  /i6l/:  f/E(/i,  Q,  <J/n)  -  min  LB(h' ,  Q,  S/n )  >  P(JR)K“1(2/i0)' 


=  h  e  V  :  LB(h,  Q,  S/n )  -  min  UB{h Q,  S/n )  >  P(JR)K-1(2^)“K  -  4G(|Q|,  5/n) 

ft'ev 


By  definition,  every  h  eV  has  LB(h ,  Q,S/n )  <  min^/gy  UB(h! ,  Q,  S/n),  so  for  this  last  set  to 
be  nonempty  after  step  7,  we  must  have  P(i?)K-1(2/i6l)-K  <  AG(\Q\,  S/n).  On  the  other  hand,  if 
h  G  V  :  limsupfcP(/i(X)  ^  h^{X))  >  }  =  0,  then 


P (DIS(V))  <  P (DIS({h  E  C  :  limsupP (h(X)  ±  h(k)(X))  <  F(R)/(29)})) 

k 

=  limsupP (DIS({h  e  C  :  P (h(X)  ±  h(k)(X))  <  F(R)/{29)}))  <  limsup^^  = 

k  k  ^  U  Z 

so  that  we  will  definitely  satisfy  the  condition  in  step  2  on  the  next  round.  Since  \Q\  gets  reset 
to  0  upon  reaching  step  3,  we  have  that  after  every  execution  of  step  7,  F(R)K~1(2fi9)~K  < 
4G(|Q|  —  1,  S/n). 

If  P (R)  <  2g(\q\-i  s/n)  —  9G(|q|  <5/n)’  ^en  certainly  f3t  <  e.  So  on  any  round  for  which 
/ 3t  >  e,  we  must  have  P(i?)  >  9G(|q|1i  S/n)-  Combined  with  the  above  observations,  on  any 
round  for  which  3t  >  e,  (2G(|Q|li|g/n))  (2//0)  K  <  4G(|Q|  -  1,  S/n),  which  implies  (by 

simple  algebra) 

ici<  C 


(6 fid)2  (  In  -  +  (d  +  1)  ln(ra)  )  +  1. 
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Since  we  need  to  reach  step  3  at  most  |~log(l/e)~|  times  before  we  are  guaranteed  some  (3t  <  e 
(F(R)  is  at  least  halved  each  time  we  reach  step  3),  any 

2H+ 

suffices  to  guarantee  some  j3t  <  e.  This  implies  the  stated  result  by  basic  inequalities  to  bound 
the  smallest  value  of  e  satisfying  (12.51)  for  a  given  value  of  n.  □ 


(d  +  1)  ln(ra)  +1  log2 


(2.5) 


If  the  disagreement  coefficient  is  relatively  small,  Theorem  12 . 1 01  can  represent  a  significant 
improvement  in  convergence  rate  compared  to  passive  learning,  where  we  typically  expect  rates 


of  ordern  K//(2k  lj  [Mammen  and  Tsybakov,  1999, 


Tsvbakov 


20041 :  this  gap  is  especially  no¬ 


table  when  the  disagreement  coefficient  and  k  are  small.  In  particular,  the  bound  mate 


log  factors)  the  form  of  the  minimax  rate  lower  bound  proven  by 


Castro  and  Nowak 


threshold  classifiers  (where  9  =  2).  Note  that,  unlike  the  analysis  of 


les  ( up  to 


20061  for 


Castro  and  Nowak  [2006], 


we  do  not  require  the  algorithm  to  be  given  any  extra  information  about  the  noise  distribution, 
so  that  this  result  is  somewhat  stronger;  it  is  also  more  general,  as  this  bound  applies  to  an  arbi¬ 
trary  hypothesis  class.  In  some  sense,  Theorem  12. 101  is  somewhat  surprising,  since  the  bounds 
UB  and  LB  used  to  define  the  set  V  and  the  bounds  (3t  are  not  themselves  adaptive  to  the  noise 
conditions. 

Note  that,  as  before,  n  gets  divided  by  02  in  the  rates  achieved  by  A2.  As  before,  it  is  not 
clear  whether  any  modification  to  the  definitions  of  UB  and  LB  can  reduce  this  exponent  on 
6  from  2  to  1.  As  such,  it  is  natural  to  investigate  the  rates  achieved  by  Algorithm  2  under 
7sybakov( C,  k,  //);  we  know  that  it  does  improve  the  dependence  on  9  for  the  worst  case  rates 
over  distributions  with  any  given  noise  rate,  so  we  might  hope  that  it  does  the  same  for  the 
rates  over  distributions  with  any  given  values  of  fi  and  k.  Unfortunately,  we  do  not  presently 
know  whether  the  original  definition  of  Algorithm  2  achieves  this  improvement.  However,  we 
now  present  a  slight  modification  of  the  algorithm,  and  prove  that  it  does  indeed  provide  the 
desired  improvement  in  dependence  on  9,  while  maintaining  the  improvements  in  the  asymptotic 
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dependence  on  n.  Specifically,  consider  the  following  definition  for  the  threshold  in  Algorithm 

2. 

A m(£,  Q,  h{y\  h(-y\  5)  =  3£c(£  U  Q,  5;  £),  (2.6) 


where  £e(-,  •;  ■)  is  defined  in  Section l2~6l  based  on  a  notion  of  local  Rademacher  complexity 
studied  by  Koltchinskii  [2006].  Unlike  the  previous  definitions,  these  definitions  are  known  to 
be  adaptive  to  Tsybakov’s  noise  conditions,  so  that  we  would  expect  them  to  be  asymptotically 
tighter  and  therefore  allow  the  algorithm  to  more  aggressively  prune  the  set  of  candidate  hypothe¬ 
ses.  Using  these  definitions,  we  have  the  following  theorem;  its  proof  is  included  in  Section  12/71 


Theorem  2.11.  Suppose  hn  is  the  classifier  returned  by  Algorithm  2  with  threshold  as  in  (12.61). 
when  allowed  n  label  requests  and  given  confidence  parameter  5  >  0.  Suppose  further  that 
T>xy  £  7sybakov(  C,  k,  fi)  for  finite  parameter  values  k  >  1  and  p  >  0  and  VC  class  C.  Then 
there  exists  a  finite  (n  and  p  - dependent )  constant  c  such  that,  with  probability  >  1  —  5,  Vn  6  N, 


er(hn)  -  v  < 


■  exp 


cdO  log 3(d/8) 


dd  log2  (dn/5)  \  2k— 2 


when  k  =  1 
when  k  >  1 


Note  that  this  does  indeed  improve  the  dependence  on  6,  reducing  its  exponent  from  2  to  1; 
we  do  lose  some  in  that  there  is  now  a  square  root  in  the  exponent  of  the  k  —  l  case,  but  it  is 
likely  that  an  improved  definition  of  £  and  a  refined  analysis  can  correct  this.  The  bound  in  The- 
oreml2.1  flis  stated  in  terms  of  the  VC  dimension  d.  However,  for  certain  nonparametric  function 
classes,  it  is  sometimes  preferable  to  quantify  the  complexity  of  the  class  in  terms  of  a  constraint 


on  the  entropy  (with  bracketing)  o 


20071  Koltchinskii,  2006, 


Tsvbakov 


the  c 


2004 


ass 


Entropy ^(C,  a,  p)  Tsee  e.g.. 


van  der  Vaart  and  Wellner 


Castro  and  Nowak, 


199681. 


In  passive  learning,  it  is  known  that  empirical  risk  minimization  achieves  a  rate  of  order 

n-K/(2n+P-i) ^  un(jer  gntrovv n(C,  a,  p)  D  7sybakov(C,  k,  p),  and  that  this  is  sometimes  tight 


[Koltchinskii,  2006, 


Tsvbakov 


200411 .  The  following  theorem  gives  a  bound  on  the  rate  of  con¬ 


vergence  of  the  same  version  of  Algorithm  2  as  in  Theorem|2.1 11  this  time  in  terms  of  the  entropy 
with  bracketing  condition  which,  as  before,  is  faster  than  the  passive  learning  rate  when  the  dis- 
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agreement  coefficient  is  small.  The  proof  of  this  is  included  in  Section  12/71 


Theorem  2.12.  Suppose  hn  is  the  classifier  returned  by  Algorithm  2  with  threshold  as  in  d EE 
when  allowed  n  label  requests  and  given  confidence  parameter  5  >  0.  Suppose  further  that 
T>xy  G  Entropy  a,  p)  fl  7syhakov{  C,  n,  p)  for  finite  parameter  values  n  >  1,  p  >  0, 
a  >  0,  and  p  G  (0, 1).  Then  there  exists  a  finite  (7c,  p,  a  and  p  - dependent )  constant  c  such  that, 
with  probability  >  1  —  5,  Vn  G  N, 


er{hn )  —  v  <  c 


9  log  (n/ 5)  \  2k+p 


n 


Although  this  result  is  stated  for  Algorithm  2,  it  is  conceivable  that,  by  modifying  Algorithm 
1  to  use  definitions  of  V  and  ffi  based  on  £c (Q,  5;  0),  an  analogous  result  may  be  possible  for 
Algorithm  1  as  well. 


2.4  Model  Selection 


While  the  previous  sections  address  adaptation  to  the  noise  distribution,  they  are  still  restrictive 
in  that  they  deal  only  with  finite  complexity  hypothesis  classes,  where  it  is  often  unrealistic 
to  expect  convergence  to  the  Bayes  error  rate  to  be  achievable.  We  address  this  issue  in  this 
section  by  developing  a  general  algorithm  for  learning  with  a  sequence  of  nested  hypothesis 
classes  of  increasing  complexity,  similar  to  the  setting  of  Structural  Risk  Minimization  in  passive 


learning  [  Vapni.k. 


1982].  The  starting  point  for  this  discussion  is  the  assumption  of  a  structure  on 


C,  in  the  form  of  a  sequence  of  nested  hypothesis  classes. 


Cx  c  C2  c  ■  ■  • 

Each  class  has  an  associated  noise  rate  z/j  =  inf/jgQ  er(h),  and  we  define  =  lim  z/j.  We  also 

i— >oo 

let  0i  and  d,  be  the  disagreement  coefficient  and  VC  dimension,  respectively,  for  the  set  Q.  We 
are  interested  in  an  algorithm  that  guarantees  convergence  in  probability  of  the  error  rate  to 
We  are  particularly  interested  in  situations  where  =  z/*,  a  condition  which  is  realistic  in  this 
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setting  since  Q  can  be  defined  so  that  it  is  always  satisfied  [see  e.g., 


Devrove.  Gvorfi.  and  Lugosi, 


19961.  Additionally,  if  we  are  so  lucky  as  to  have  some  =  u* ,  then  we  would  like  the  conver¬ 


gence  rate  achieved  by  the  algorithm  to  be  not  significantly  worse  than  running  one  of  the  above 
agnostic  active  learning  algorithms  with  hypothesis  class  C,  alone.  In  this  context,  we  can  de¬ 
fine  a  structure-dependent  version  of  Tsybakov’s  noise  condition  by  (~)  7sybakov('Cl,  //,),  for 

iei 

some  /  C  N,  and  finite  parameters  >  1  and  //  ,-  >  0. 

In  passive  learning,  there  are  several  methods  for  this  type  of  model  selection  which  are 
known  to  preserve  the  convergence  rates  of  each  class  Q  under  7sybakov( Q,  k,:,  /i,,).  [e.g., 


Koltchinskii,  2006, 


Tsvbakov 


2004] .  In  particular,  Koltchinskii  [2006]  develops  a  method  that 


performs  this  type  of  model  selection;  it  turns  out  we  can  modify  Koltchinskii’s  method  to  suit 
our  present  needs  in  the  context  of  active  learning;  this  results  in  a  general  active  learning  model 
selection  method  that  preserves  the  types  of  improved  rates  discussed  in  the  previous  section. 
This  modification  is  presented  below,  based  on  using  Algorithm  2  as  a  subroutine.  (It  may  also 
be  possible  to  define  an  analogous  method  that  uses  Algorithm  1  as  a  subroutine  instead.) 


Algorithm  3 

Input:  nested  sequence  of  classes  {Q},  label  budget  n,  confidence  parameter  5 

Output:  classifier  hn 

0.  For i  =  Ly^/2j,LyAV2j  -1,LvW2J  -2,...,1 

1 .  Let  Cm  and  Qm  be  the  sets  returned  by  Algorithm  2  run  with  C,  and  the 
threshold  in  (12.6k  allowing  \n/ (2i2)J  label  requests,  and  confidence  5/{2i2) 

2.  Let  hin  *  Learn Ci(Uj>,;L)n,  Qin ) 

3.  If  hinges  and  Vj  s.t.  i  <  j  <  LvW2j , 

erCjnUQj„{hin)  —  er£jnuQjn(hjn)  <  2^c3  (A'nOQjn;  ^/(2j  £jn) 

4.  hn  hin 

5 .  Return  hn 

The  function  £.(■,•;■)  is  defined  in  Section  lT6l  This  method  can  be  shown  to  correctly 
converge  in  probability  to  an  error  rate  of  at  a  rate  never  significantly  worse  than  the  original 
passive  learning  method  of  Koltchinskii  [2006],  as  desired.  Additionally,  we  have  the  following 
guarantee  on  the  rate  of  convergence  under  the  structure-dependent  definition  of  Tsybakov’s 
noise  conditions.  The  proof  is  similar  in  style  to  Koltchinskii’s  original  proof,  though  some 
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care  is  needed  due  to  the  altered  sampling  distribution  and  the  constraint  set  Crn.  The  proof  is 
included  in  Section  12771 

Theorem  2.13.  Suppose  hn  is  the  classifier  returned  by  Algorithm  3,  when  allowed  n  label 

requests  and  confidence  parameter  5  >  0.  Suppose  further  thatVXY  £  f~1  7syb(ikov{C,.  Kt,  //, ) 

iei 

for  some  nonempty  I  C  N  and  for  finite  parameter  values  k,:  >  1  and  //,  >  0.  Then  there  exist 
finite  ( Ki  and  p,  - dependent )  constants  Ci  such  that,  with  probability  >  1  —  5,  Vn  >  2, 


er(hn )  -  z/oo  <  3  min(z/j  -  u^)  +  < 
iei 

In  particular,  if  we  are  so  lucky  as  to  have  ul  =  u*  for  some  finite  i  E  I,  then  the  above  algorithm 
achieves  a  convergence  rate  not  significantly  worse  than  that  guaranteed  by  Theorem  12.1  II for 
applying  Algorithm  2  directly,  with  hypothesis  class  Q. 

As  in  the  case  of  finite-complexity  C,  we  can  also  show  a  variant  of  this  result  when  the 
complexities  are  quantified  in  terms  of  the  entropy  with  bracketing.  Specifically,  consider  the 
following  theorem;  the  proof  is  in  Section  12.71  Again,  this  represents  an  improvement  over 
known  results  for  passive  learning  when  the  disagreement  coefficient  is  small. 


•  exp 


CidiOi  log3  -f 


djdj  log2  N  2Ki~2 


,  if  Hi  =  1 

if  ^  >  1 


Theorem  2.14.  Suppose  hn  is  the  classifier  returned  by  Algorithm  3,  when  allowed  n  label 
requests  and  confidence  parameter  5  >  0.  Suppose  further  that 

T>xy  ^  fl  7sybakov(Ci,  Ki,  pf)  D  Entropy  JCi,  a^,  pf)  for  some  nonempty  ICN  and  finite 

iei 

parameters  >  0,  K;  >  1,  a*  >  0  and  pi  G  (0, 1).  Then  there  exist  finite  (k^  pu  cii  and  pi 
-dependent)  constants  Ci  such  that,  with  probability  >  1  —  6,  Vn  >  2, 


er(hn)  -  <  3  min(^  -  v^)  +  ct 

i£l 


In  addition  to  these  theorems  for  this  structure-dependent  version  of  Tsybakov’s  noise  con¬ 
ditions,  we  also  have  the  following  result  for  a  structure-independent  version. 


34 


Theorem  2.15.  Suppose  hn  is  the  classifier  returned  by  Algorithm  3,  when  allowed  n  label 
requests  and  confidence  parameter  8  >  0.  Suppose  further  that  there  exists  a  constant  p  >  0 
such  that  for  all  measurable  h  :  X  — >  {  —  1, 1},  er{h )  —  u*  >  p¥{h(X)  h*(X)}.  Then  there 

exists  a  finite  (p-dependent)  constant  c  such  that,  with  probability  >  1  —  5,  Vn  >  2, 

er(hn)  -  v*  <  cmin(z/i  -  v*)  +  exp  <  -  ”  3  id.  >  . 

*  [  y  cdtdi  logd  if  J 

The  case  where  er(h)  —  is*  >  pP{h(X )  h*(X)}K  for  k  >  1  can  be  studied  analogously, 

though  the  rate  improvements  over  passive  learning  are  more  subtle. 


2.5  Conclusions 

Under  Tsybakov’s  noise  conditions,  active  learning  can  offer  improved  asymptotic  convergence 
rates  compared  to  passive  learning  when  the  disagreement  coefficient  is  small.  It  is  also  possible 
to  preserve  these  improved  convergence  rates  when  learning  with  a  nested  structure  of  hypothesis 
classes,  using  an  algorithm  that  adapts  to  both  the  noise  conditions  and  the  complexity  of  the 
optimal  classifier. 


2.6  Definition  of  £ 

For  any  function  /  :  X  — >  R,  and  £1,  £2,  •  •  •  a  sequence  of  independent  random  variables  with 
distribution  uniform  in  {—1,  +1},  define  the  Rademacher  process  for  /  under  a  finite  sequence 
of  labeled  examples  0  =  {(X[,  Y')}  as 

.  I  <31 

R(f;Q)=  |Q| 

The  £*  should  be  thought  of  as  internal  variables  in  the  learning  algorithm,  rather  than  being 
fundamental  to  the  learning  problem. 
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For  any  two  sequences  of  labeled  examples  C  =  {(X',F/)}  and  0  =  {(X"  ,Y")},  define 
C [£]  =  {he  C  :  erc(h)  =  0}, 


let 


and  define 


C(e;  C,  Q)  =  {h  e  C[£]  :  erQ{h)  -  erQ(ti)  <  e}, 


.  IQI 

bc(e,C,Q)  =  sup  (X")^h,(Xi% 

hlth2eC(e-,C,Q)  M  i=1 


4)c(e]jC,Q)  =  -  sup  R(h1-h2;Q). 

Z  h!,h2€C (e;£,Q) 


Let  5  G  (0, 1],  m  G  N,  and  define 


Sm(5)  =  In 


20m2  log2(3m) 


Let  Ze  =  {j  e  Z  :  2J  >  e},  and  for  any  sequence  of  labeled  examples  Q  =  {(X',  Y[)}, 
define  Qm  =  {(X{,  Y{),  (X2,  F2), . . . ,  (X'm,  F/J}.  We  use  the  following  notation  of  Koltchin- 
skii  Koltchinskii  [2006]  with  only  minor  modifications.  For  e  G  [0,1],  define 

t/c(e,  S-,  C,  Q)  =  mc(cx ;  £,Q)+^/,|<>'W^}fe 

£C(Q,5;£)=  min  inf{e> 0 :VjG Ze,f/C(2^;  £,  Qm)  <2J”4) 
m<|Q|  t  J 

where,  for  our  purposes,  we  can  take  K  =  752,  and  c—  3/2,  though  there  seems  to  be  room  for 
improvement  in  these  constants.  We  also  define  £c(0,  5;  C,  C)  =  oo  by  convention. 


2.7  Main  Proofs 

Let  £c(m,  <5)  =  £c(Zm,  S;  0).  For  each  me  N,  let  h*n  =  argmin  erm{h)  be  the  empirical  risk 

h  SC 

minimizer  in  C  for  the  true  labels  of  the  first  m  examples. 

For  e  >  0,  define  C(e)  =  {h  G  C  :  er(h )  —  ^  <  e}.  For  m  e  N,  let 

0c(m,e)=E  sup  |(er(/ii)  -  erm(/ii))  -  (er(/i2)  -  erm(/i2))|, 

/il,/i2GC(e) 
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TT  ,  «  v  i  ±  (  ~  \  ,  sm(5)diam(C(ce))  sm(5) 

Uc{m,  e,o)  —  K  4>c(m,  ce)  +  \j - h 


m 


m 


i-  4 


£c (m,  5)  =  inf  je  >  0  :  Vj  G  Ze,  Uc{m ,  2J,  5)  <  2J 

where,  for  our  purposes,  we  can  take  K  =  8272  and  c  =  3.  We  also  define  £c(05  <5)  =  oo.  The 
following  lemma  is  crucial  to  all  of  the  proofs  that  follow. 


Lemma  2.16.  [Koltchinskii,  2006]  There  is  an  event  E.rj  with  P(Z7r.,))  >  1  —  5/2  such  that,  on 
event  Ecj,  Vm  G  N,  V7i  G  C,  Vr  G  (0, 1/m),  \/h'  G  C(r), 

er(/i)  —  v  <  max  <  2 (erm(h)  -  erm(h')  +  r),  £c (m, 


erm{h)  -  erm{h*n )  <  |  max  (er(h)  -  u),  £c(m, 


£c(m,5)  <  £c(m,5), 


and  for  any  j  G  Z  wjY/z  2j  >  £c  (m,  5), 


sup  |(erm(/ii)  —  er(/i1))  -  (erm(h2)  -  er(h2))\  <  Uc(2?,6-,Q,Zm). 
hi,h2eC(Zi) 


This  lemma  essentially  follows  from  details  of  the  proof  of  Koltchinskii’s  Theorem  1 ,  Lemma 


2,  and  Theorem  3  [Koltchinskii,  2006^1.  We  do  not  provide  a  proof  of  Lemma  12.161  here.  The 
reader  is  referred  to  Koltchinskii’s  paper  for  the  details. 


2.7.1  Definition  of  tq 

If  0  is  bounded  by  a  finite  constant,  the  definition  of  r0  is  not  too  important.  However,  in  some 
cases,  setting  r0  =  0  results  in  a  suboptimal,  or  even  infinite,  value  of  9,  which  is  undesirable. 
In  these  cases,  we  would  like  to  set  r0  as  large  as  possible  while  maintaining  the  validity  of 
the  bounds,  and  if  we  do  this  carefully  we  should  be  able  to  establish  bounds  that,  even  in  the 
worst  case  when  9  —  l/r0,  are  never  worse  than  the  bounds  for  some  analogous  passive  learning 

'Our  min  modification  to  Koltchinskii’s  version  of  £c (w,  5)  is  not  a  problem,  since  (f>c{m ,  e)  and  are 

m<\Q\ 

nonincreasing  functions  of  m. 
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method;  however,  to  do  this  requires  r0  to  depend  on  the  parameters  of  the  learning  problem: 
namely,  n,  5,  C,  and  DXY. 

Generally,  depending  on  the  bound  we  wish  to  prove,  different  values  of  r0  may  be  appro¬ 
priate.  For  the  tightest  bound  in  terms  of  6  proven  below  (namely,  Lemma  12.181).  the  following 
definition  of  r0  gives  a  good  bound.  Defining 

f  4m2  m_1  ~  1 

rhc(n,S,VXY )  =  min  <  m  e  N  :  n  <  log2  — — — b  2e  ^  P(D/5'(C(2£c(£,  5))))  >  ,  (2.7) 

l  i=o  J 

we  can  let  r0  =  rc(n,  5,  T>XY),  where 

1  rhc(n,8,T>xY)—l 

rc{n,S,VXY)  =  — 7 — — — -  V]  dmm(C(2£c(mc(r/,  n,  5),  6))).  (2.8) 

mc{n,d,VXY) 

We  use  this  definition  in  all  of  the  proofs  below.  In  particular,  with  this  definition,  Lemmal2.18lis 
never  significantly  worse  than  the  analogous  known  result  for  passive  learning  (though  it  can  be 
significantly  better  when  9  «  l/r0).  For  the  looser  bounds  (namely,  Theoremsl2.lllandl2.12L 
a  larger  value  of  r0  would  be  more  appropriate;  however,  note  that  this  same  general  technique 
can  be  employed  to  define  a  good  value  for  r0  in  these  looser  bounds  as  well,  simply  using  upper 
bounds  on  (12.81)  analogous  to  how  the  theorems  themselves  are  derived  from  Lemmal2. 1 8lbelow. 

2.7.2  Proofs  Relating  to  Section  l23l 

For  (eNU  {0},  let  C<(:>  and  C)11'  denote  the  sets  C  and  Q,  respectively,  in  step  4  of  Algorithm 
2,  when  m  —  1  =  l\  if  this  never  happens  during  execution,  then  define  ’  =  0,  Q1' 1  =  Zf  . 
Lemma  2.17.  On  event  ECts,  VfbNU  {0}, 

tc(Q(e)  U£W,5;  £W)  =  £.C{£,6) 

and 

_ V6  >  tc(Z,  5),  K  e  Q(e;  £ffl)  C  Q(e;  0). _ 

Proof  of  Lemma  B.  1 71  Throughout  this  proof,  we  assume  the  event  Ec,s  occurs.  We  proceed  by 
induction  on  l,  with  the  base  case  of  i  =  0  (which  clearly  holds).  Suppose  the  statements  are  true 
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for  all  t’  <  l.  The  case  C  '  =  0  is  trivial,  so  assume  C‘]  ^  0.  For  the  inductive  step,  suppose 


heC£{Ec(£,S)J). 

Then  for  all  £'  <  i,  we  have 

ere(h)  —  ere(h})  <  Ec(£',5). 

In  particular,  by  Lemma  12.161  this  implies 

er(h)  —  v  <  max  |2 (erg(h)  —  erg(hg)),Ec(£,8) j  <  2£C(T,<5), 

and  thus  for  any  h!  e  C, 

ere(h)  —  er^(h')  <  er^(h)  —  erg'(h};) 

<  ^max{er{h)-v,tc(t',8)}  <3£c(f,<5)  =  3£C(Q(°,  5;  £(r)). 

Thus,  we  must  have  ercw(h)  =  0,  and  therefore  h  e  Q(£c(£,  <5);  £w).  Since  this  is  the  case 
for  all  such  h,  we  must  have  that 

Cf(£c(M);£M)5C,(£c(^);l).  (2.9) 


In  particular,  this  implies  that 

Uc(£c(£,  5),  QW)  >  f7c(£c(A  5),  0,  Z<)  >  ^-tc(£,  5), 

lo 

where  the  last  inequality  follows  from  the  definition  of  &c(£,  8),  (which  is  a  power  of  2).  Thus, 
we  must  have  £c {Q^  U  C^\8\  C (^)  >  £c(-(,  5). 

The  relation  in  dH  also  implies  that 

h*eet£{Ec(£,sy,C^), 

and  therefore 

Ve  >  tc{£,  5),  Q(e;£^)  CQ(e;0), 
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which  implies 


Ve  >  lc(£,  8),  Uc(e,  5;  £w,  Qw)  <  Uc(e,  8 ;  0,  Ze). 


But  this  means  £c (Q^  U  C^\8\  C^)  <  £c(£:  8).  Therefore,  we  must  have  equality.  Thus,  the 
lemma  follows  by  the  principle  of  induction.  □ 


Lemma  2.18.  Suppose  for  any  n  e  N,  h7l  is  the  classifier  returned  by  Algorithm  2  with 
threshold  as  in  dm  when  allowed  n  label  requests  and  given  confidence  parameter  8  >  0,  and 
suppose  further  that  mn  is  the  value  of\Q\  +  £  when  Algorithm  2  returns.  Then  there  is  an 
event  such  that  P(£Tc ,<5  O  Ects)  >1  —  5,  such  that  on  HCj  fl  E^s,  Vn  e  N, 

er(hn)  -  v  <  £c(mn,  8 ), 


and 


4  m. 


mn  —  l 


n  <  min  <  mn,  log2  n  +  4 e6  diam( C(2£c(£,  5))) 


e=o 


Proof  of  Lemma  12. 181  Once  again,  assume  event  EC)s  occurs.  By  Lemma I2TT61  Vr  >  0, 


er(hn)  -  v  <  max  j 2(ermn(hn )  -  ermn(h*mJ  +  r),  £c(mn,5)| . 

Letting  r  ->  0,  and  noting  that  erc(h*nJ  =  0  (Lemma|2T71)  implies  ermn(hn)  =  errnn(fi*nJ, 
we  have 

er(hn)  -  v  <  £c(,mn,  8)  <  £c(mn,  8), 

where  the  last  inequality  is  also  due  to  Lemma  12.161  Note  that  this  8,c(mn,8)  represents  an 
interesting  data-dependent  bound. 

To  get  the  bound  on  the  number  of  label  requests,  we  proceed  as  follows.  For  any  m  €  N, 
and  nonnegative  integer  £  <  m,  let  f  be  the  indicator  for  the  event  that  Algorithm  2  requests 
the  label  Y(+\  and  let  Nm  =  JfrLo'  £•  Additionally,  let  I't  be  independent  Bernoulli  random 
variables  with 

F[I'  =  1]=f{dIS{C{2£c(£,8)))}. 


40 


Let  N'm  =  J2T=o  I'e-  We  have  that 

P  [{£  =  1}  n  ECtS]  <  P  \{Xe+1  e  DIS{Ce{£c{Qw  u  £w,  <5;  £f);  £<*>))}  D  ECiS 


<  P 


{Xe+1eDIS(Ce(lc(^S)J))}nECts  <P  DIS(C(2Zc(£,5)))  =P[/;  =  1]. 


The  second  inequality  is  due  to  Lemmas  12.171  and  12.161  while  the  third  inequality  is  due  to 
Lemma  12.161  Note  that 


771—1 


771—1 


er]  =  Epw=1i  =  J2r{DIS<-c(-2^v,t> 

i=o  e=o 

Let  us  name  this  last  quantity  qm.  Thus,  by  union  and  Chernoff  bounds, 


P 


3m  6  N  :  Nm  >  max  2 eqm,  qm  +  log. 


4m2 


n  E, 


C,5 


< 


£p 


ttiGN 


Nm  >  max  2  eqmj  qm  +  log2 


4m2 


n  e. 


€,<5 


<£p 

m£N 


N'm  >  max  2 eqm,  qm  +  log2 


4m2 


5  5 

<  >  - 7T  <  -. 

z—''  4m2  2 

meN 


For  any  n,  we  know  n  <  mn  <  2n.  Therefore,  we  have  that  on  an  event  (which  includes  Ec,s) 
occuring  with  probability  >1  —  5,  for  every  n  e  N, 


n  <  max{lVmn ,  log2  mn}  <  max  2 eqmn,qmn  +  log2 


4m2 


< 


4m, 


mn— 1 


log2— ^  +  2e  ^  F{DIS(C(2E<c(£,  5)))}. 


e=o 


In  particular,  this  implies  rhn  =  mc(n,  5,  VXY)  <  mn  (where  mc(n ,  5,  VXY)  is  defined  in  (12.71)). 
We  now  use  the  definition  of  6  with  the  r0  in  d. 

2  rhn— 1 


n  <  log2  ^  +  2e  ^  V{DIS(C(2£c(t,  5)))} 


< 


4m; 


1=0 
rhn- 1 


l°g2  -f-  +  2e#  ^  max{dmm(C(2£c((?,  5))),rc(n,  5,  VXY)} 


1=0 
rhn— 1 


mn- 1 


4^,2  1  „  4m2  x , 

<  log2 — r^1  +  4e6*  diam(C(2H<c(E:  5)))  <  log2 — ^  diam(C(2£c(^,  5))). 

£=0  £=0 


□ 
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Lemma  2.19.  On  event  He, 5  D  Ec,s  ( where  He, 5  is  from  Lemma\2.18\).  under 
7sybakov(C,  k,  p),  Vn  E  N, 


£c (mn,5)  < 


■  exp 


edd  log3  4 


,  ifn  =  1 


d6  log2  (nd/S)  \  2k— 2 


if  k  >  1 


for  some  finite  constant  c  ( depending  on  n  and  p),  and  under 
Entropy n(C,  a,  p )  fl  7sybakov( C,  k,  p),  Vn  E  N, 


£c(mn,<5)  <  c 


6  log  (n/ 5)  \  2k+p 


n 


for  some  finite  constant  c  ( depending  on  n,  p,  p,  and  a). 


Proof  of  Lemma  12.191  We  begin  with  the  first  case  {7 sybakov (C,  k,  p)  only). 
We  know  that 


uJc(m,  e)  <  K\ 


ed  log  - 


for  some  constant  K  [see  e.g., 


Massart  and  Elodie  Nedelec 


20061 .  Noting  that  0c (m,  e)  < 


LUc(m,  diam( C(e))),  we  have  that 


uc(m  e  5)  <  K  I  K\j  dmm^Ce^dhg  dia^cjce))  +  J  sm(S)diam(C(ce))  |  sm(5) 


m 


m 


m 


<  K'  max 


eV^dlogi  jsjfi )eV*  sm{5) 


m 


m  m 


Taking  any  e  >  K "  lj  j  1 ,  for  some  constant  K"  >  0,  suffices  to  make  this  latter  quantity 

<te-  So  for  some  appropriate  constant  K  (depending  on  p  and  k),  we  must  have  that 

'  d  log  y  0  2k~1 


£c  (m,  5)  <  K 


m 


(2.10) 


Plugging  this  into  the  query  bound,  we  have  that 

n  <  log2  +  2e0  (2  +  j  "  p{2K')$ 


1 

2k-1 


(2.11) 
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If  k  >  1,  (12.1  II)  is  at  most  K"6rrinK  1  d log  7Jjl,  for  some  constant  K"  (depending  on  k  and 
p).  This  implies 

m„  >  Ji (3) 


n 


2/c-l 
2k  —  2 


9d  log  ■ 

for  some  constant  K^\  Plugging  this  into  (12.101)  and  using  Lemma  12. 1 81  completes  the  proof  for 
this  case. 

On  the  other  hand,  if  k  —  1,  (12.1  lh  is  at  most  K"6d  log2  for  some  constant  K"  (depending 
on  k  and  p).  This  implies 

mn  >  Sexp  |^(3)y^|  ’ 

for  some  constant  K^3\  Plugging  this  into  (12.101).  using  Lemma  12.181  and  simplifying  the  ex¬ 
pression  with  a  bit  of  algebra  completes  this  case. 

For  the  bound  in  terms  of  p,  Koltchinskii  [2006]  proves  that 


~  .  I _ K 

Ec(m,  <5)  <  K  max  <  m  2k+p-1  , 


logf^  2k-1  {  <  ,  ( logf  ^ 


m 


m 


(2.12) 


for  some  constant  K'  (depending  on  p,  a,  and  k).  Plugging  this  into  the  query  bound,  we  have 
that 


n  <  log2  +  2e6  I  2  +  /  p{2 K')«  (  ^  J  1  <  K"0m?+p- 1  log 


Wln-l 


,a/l ogf\  2«+p-i 


2k-|-/9 — 2 


mn 


for  some  constant  K"  (depending  on  n,  p,  a,  and  p).  This  implies 


mr 


>  JT(3) 


2k-\~p~  1 
\  2K-\-p — 2 


for  some  constant  K^\  Plugging  this  into  (12.12ft  and  using  Lemma  12 . 1 8 1  complete  s  the  proof  of 
this  case.  □ 


Proofs  of  Theorem  \2.11\and  Theorem  \2.12\  These  theorems  now  follow  directly  from  Lem¬ 
mas  |2T8]  and  |2T9]  □ 
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2.7.3  Proofs  Relating  to  Section  1231 


Lemma  2.20.  Fori  G  N,  let  Si  =  <5/(2  i2)  andmin  =  \Cin\  +  \Qin\  ( fori  >  yJn/2,  define 
Cr„  =  Qin  =  (/}).  For  each  n,  let  in  denote  the  smallest  index  i  satisfying  the  condition  on  hin  in 
step  3  of  Algorithm  3.  Let  rn  =  2~n  and  define 


and 


i*n  =  min  [i  G  N  :  Vi'  >  i,\/j  >  i',\/h  G  Cj/(rn),  erqn(h)  =  0}, 


3n  =  argminz/j  +  8.Cj{mjn,  Sj). 


Then  on  the  event  p|  Ec^.s,, 
2=1 


Vn  G  N,  max  \i*n ,  in  \  <  j*. 


Proof  of  Lemma  12.201  Continuing  the  notation  from  the  proof  of  Lemmal2.17l  for  i  G  N  U  {0} , 
let  dfl  and  Q1^  denote  the  sets  C  and  0,  respectively,  in  step  4  of  Algorithm  2,  when  m  —  1  = 
i,  when  run  with  class  Q,  label  budget  \n/(2i2)\,  confidence  parameter  Si,  and  threshold  as 
in  flZl;  if  m  —  1  is  never  i  during  execution,  then  define  =  0  and  =  Zt. 

OO 

Assume  the  event  f~]  EcuSi  occurs.  Suppose,  for  the  sake  of  contradiction,  that  3  =  in  <  i*n 

2=1 

for  some  n  G  N.  Then  there  is  some  i  >  i*n  —  1  such  that,  for  some  i  <  m*n,  we  have  some 
h!  G  Ci»_i(rn)  D  {h  G  Q  :  er  w(h)  =  0}  but 

^ in 

ere(h') -min  erfih)  >  erfih’)-  min  erfih)  >  3£Ci(£-n  ^  ^in)  =  3£c 


hec, 


h&Ci’.er  (7i)= 0 


where  the  last  equality  is  due  to  Lemma  12.171  Lemma  12.161  implies  this  will  not  happen  for 
i  —  i*n  —  1,  so  we  can  assume  i  >  i*n.  We  therefore  have  (by  Lemma  12.161)  that 


3£Ci(^,  Si)  <  er^ti)  -  min  erfih)  <  -  max  rn  +  £Ci(^,  ^ . 

/igCi  Z  t  J 

In  particular,  this  implies  that 

-  -  3  3 

3£c Amn,  Si)  <  3£Ci(£,  Sf)  <  -  (rn  +  -  z'j)  <  -  {rn  +  -  vf) . 


Therefore, 


A  ^1  Tn 

£c j(mjn,  Sj)  +  Uj  <  £cj (,min,  Si)  +  <  -  (rn  +  ^  -  z'*)  +  <  y  + 
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This  would  imply  that  £r  ■  (jnjm  Sj)  <  rn/ 2  <  — (due  to  the  second  return  condition  in  Al- 

TYljn, 

gorithm  2),  which  by  definition  is  not  possible,  so  we  have  a  contradiction.  Therefore,  we  must 
have  that  every  j*  >  i*n.  In  particular,  we  have  that  Vn  G  N,  hj-n  ^  0- 

Now  pick  an  arbitrary  t  G  hi  with  i  >  j  =  j*,  and  let  h!  G  C7(rri).  Then 

erCinUQin  {hjn)  —  CfCinLlQinihin)  —  erm,in{h'jn)  ~~  ermin{hin) 


< 


ermin(hjn)  -  min  ermin(h) 


< 


< 


-  max  | er(hjn)  -  zy;,  ECi(mn,  St)  j  (Lemma|2T6l> 


-  max  | er(hjn)  -  i/j  +  Vj  -  uh  S.Ci(min,  fo) 


2  (er mjn  ( hj n )  )  T  Tn)  T  z/^ 


max 


£c j  (■ mjn ,  Sj )  +  i/j  -  z/i 
£(C;  (jH'im  &i) 


£c j(rnjn,Sj)  +  z/j  -  z/j 


-  max 
2 


£<Ci  <5j) 


2  (^mj 

3  - 

-£c(AnUQin,  £;£*„) 


(since  j  >  i* ) 

(by  definition  of  jt*) 

(by  Lemmal2.17l). 


□ 


OO 

Lemma  2.21.  On  the  event  f")  E^g.,  Vn  G  N, 

i= 1 


er(fe  J  “  ^oo  <  3  min 


T  £ci  ijTtim 


Proof  of  Lemma  12.211  Let  h'n  G  CJr»  (rn)  for  rn  G  (0,  2  n),  n  G  N. 
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er{hn)  -Voo=  er(h~inn )  - 

=  Vj*  ~v0 o  +  er(h-nn )  -  vj* 

2 (ermj*n(hinn)  -  erm  (h!n)  +  rn) 


<  Vj*  —  Uoo  +  max 


<  Uj*  —  v oo  +  max 


2(e%*nUQJ*„(^nn)  e%»7lUQJ.n(^j*n))  +  rn) 

Sj*) 

The  first  inequality  follows  from  Lemma  12.161  The  second  inequality  is  due  to  Lemmal2.20lli.e.. 
j*  >  i* ).  In  this  last  line,  we  can  let  rn  — *  0,  and  using  the  definition  of  in  show  that  it  is  at  most 


V3n  U°°  maX  \^  \  2  ^  Qjnn^  )  A>)  )  >  ( mj*n J  *  ) 


= 

l/,'*  — 

Jn 

Uoo  +  3£cj.»  8j*) 

!Lemmal2.17l) 

< 

3  min 

i 

(vi  ~Voo  +  £ci(min,fii)') 

(by  definition  of  j*) 

< 

3  min 

i 

(vi  ~  Voo  +  (minjtii)) 

lLemmal2.16l). 

□ 


We  are  now  ready  for  the  proof  of  Theoremsl2.13landl2.141 


Proofs  of  Theorem  12.131  and  Theorem  12.141  These  theorems  now  follow  directly  from  Lem¬ 
mas  E2T]  and |2T9l  That  is,  Lemma  12 . 2 1 1  gives  a  bound  in  terms  of  the  £  quantities,  holding  on 

OO  _  _  OO 

event  f]  Er.x..  and  Lemmaf2.19lbounds  these  £  quantities  as  desired,  on  event  f"|  fTci,5in£,ci,<5i- 


1=1 


1=1 


Noting  that,  by  the  union  bound,  IP 
proof. 


fl  He iA  H  ECi,Si 

i=  1 


>  1  —  >  1  —  ^  completes  the 

□ 


Define  c  =  c  +  1,  D(e)  =  lim  diam( C,(e)),  and 

j— >oo 


t  t  /  r  \  is  I  /  rS/o  u  .  I  Sm{3i)D(ce)  Sm(5i) 

UCi{m,  e,  Oi)  =  K  \  uCi{rn,  D{ce))  +  (/ - b 


m 


m 
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and 


£Ci(m,  Si)  =  inf  \  e  >  0  :  Vj  G  Z£,  2J,  S{)  <  2J 


-4 


Lemma  2.22.  For  any  m,i  E  N, 


£ci("i,  5*)  <  max  £Ci(m,  Si),  vt  -  u, 


Proof  of  Lemma  12.221  For  e  > 


Vi  -  Z'o 


TT  (  r  \  ^  |  i  /  -'ll  /  Sm(di)dicirn(Ci(c£))  Sm(Si) 

UCi(m,  e,  Si)  =  K  4>ci{,m,  ce)  +  -\/ - h 


m 


m 


<  K  I  u)Ci(m,  diam(Ci(ce)))  +  \l  +  sm((S,) 


m 


m 


But  diam(Ci(c£ ))  <  l)(ce  +  (z/?:  —  z^))  <  D(ce),  so  the  above  line  is  at  most 


r/  |  /  n(°  \\  i  i  I  Sm{Si)b{c£)  Sm(Si)  |  rr  ^  j.  ^ 

A  uc^m,  D(c£))  +  t/ - t - =UCi{m,£,Si). 


m 


m 


In  particular,  this  implies  that 


£ci(m,8i)  = 
< 

< 

< 


inf  |e  >  0  :Vje  Ze,  UCi(m,  V,  Si)  <  2^ 
inf  |e  >  (ui  -  z/oo)  :  Vj  G  Z£,  UCi(m,  23 ,  Si)  <  2J“4 
inf  |e  >  -  z/qo)  :  Vj  G  Z£,  ^(m,  2J,  <  2J_4 

max  | inf  |e  >  0  :  Vj  G  Z£,  UCi(m,  2\  Si)  <  2J~4  [  ,  (ut  -  v0 
max  <  hc.(m,Si),Vi  -  va 


□ 


Proof  of  Theorem  I2.i5l  By  the  same  argument  that  lead  to  (12.101).  we  have  that 


tCi{m,Si)  <  K2 


di  logf 


m 
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for  some  constant  K2  (depending  on  /r). 


Now  assume  the  event  He it6i  H  7,  occurs.  In  particular,  Lem m a  12.2 1 1  i m plies  that 

Vi,  neN, 


er(hn)  -  v*  <  min  1,  3  min  (  2(z/*  -  v^)  +  £c i(min,  $i)) 

iGN  \  / 


di  log  ■ 


<  Ji3  min  [Vi  —  u*)  +  min  <  1, 

ieN  1  I  min 

for  some  constant  K:> . 

Now  take  i  6  N.  The  label  request  bound  of  Lemmal2.181  along  with  Lemmal2.221  implies 

that 


Stt?  ^  A 

[n/ (2i2)J  <  log  — +  KSi  (  2  + 


rmin—l 


max  —  v 


di  log 


X 


dx 


<  K59i  max  (17  -  v*)min ,  di  log 2(min)  log  ■ 


Let  7j  (n)  = 


i26idi  log  | 

di  log^fi 


.  Then 


rrii. 


<  Kq  (  (Ui  -  u*) 1  +  di  log  j  (1  +  7 i(n))  exp  {-c2^i(n)} 

7 An)  b 


Thus, 


di  log : 


min  (  1,  ‘  ^  d  <  min  1,  K7  I  (17  -  v*)  +  dt  log  -  (1  +  7;(f))  exp  {-c27*(n)} 


The  result  follows  from  this  by  some  simple  algebra. 


□ 


2.8  Time  Complexity  of  Algorithm  2 

It  is  worth  making  a  few  remarks  about  the  time  complexity  of  Algorithm  2  when  used  with 
the  (12.61)  threshold.  Clearly  the  LearNc  subroutine  could  be  at  least  as  computationally  hard 
as  empirical  risk  minimization  (ERM)  over  C.  For  most  interesting  hypothesis  classes,  this 
is  known  to  be  NP-Hard  -  though  interestingly,  there  are  some  efficient  special  cases  [e.g., 
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Kalai,  Klivans,  Mansour,  and  Servedio,  2005].  Additionally,  there  is  the  matter  of  calculating 
£m(S;  C,  £).  The  challenge  here  is  due  to  the  localization  C(e;  C)  in  the  empirical  Rademacher 
process  calculation  and  the  empirical  diameter  calculation. 


However,  using  a  trick  similar  to  that  in 


Bartlett,  Bousquet.  and  Mendelson  2005],  we  can 


calculate  or  bound  these  quantities  via  an  efficient  reduction  to  minimization  of  a  weighted  em¬ 
pirical  error.  That  is,  the  only  possibly  difficult  step  in  calculating  <t)rn(c-  C,  C)  requires  only 
that  we  identify  h\  =  argmin  erm(h ,  £)  and  h2  =  argmin  erm(h ,  — £),  where  erm(h ,  £)  = 

h£Cm(e;C)  h£Cm(e;C) 

m  YliLi  1[^PQ)  &]  and  erm(h,—£)  is  the  same  but  with  Similarly,  letting  hc  = 

Learnc(£,  Q)  for  C  U  Q  generated  from  the  first  m  unlabeled  examples,  we  can  bound 
C,  C)  within  a  factor  of  2  by  2 erm(h',  hc)  where  h'  =  argmin  erm[h ,  —hc)  and 
erm(f,  d)  =  ~  Y^T=  1 1  [f(Xi)  7^  g(Xi)].  All  that  remains  is  to  specify  how  this  optimization  for 
/ii,/i2,and  h'  can  be  performed.  Taking  the  h  \  case  for  example,  we  can  solve  the  optimization  as 
follows.  We  find 


^(a)  —  arg  nfin  ^  ^  7^  ^j]  +  ^  ^  Al[/i(x)  7 -  y]  E-  E  2  max{l,  A}ml[/i(i)  7 -  y], 

*=1  [x,y)&Q  {x,y)&c 

where  A  is  a  Lagrange  multiplier;  we  can  calculate  h^X)  for  0(m2)  values  of  A  in  a  discrete 
grid,  and  from  these  choose  the  one  with  smallest  erm(h^x),  0  among  those  with  ercuQ{h(\))  — 
erojoikc)  <  e.  The  third  term  guarantees  the  solution  satisfies  erc{h(X))  =  0,  while  the  value 
of  A  specifies  the  trade-off  between  ercuQ(h(\))  and  erm(/qA),  £).  The  calculation  for  h2  and  h! 
is  analogous.  Additionally,  we  can  clearly  formulate  the  Learn  subroutine  as  such  a  weighted 
ERM  problem  as  well. 

For  each  of  these  weighted  ERM  problems,  a  further  polynomial  reduction  to  (unweighted) 
empirical  risk  minimization  is  possible.  In  particular,  we  can  replicate  the  examples  a  number 
of  times  proportional  to  the  weights,  generating  an  ERM  problem  on  0(m2)  examples.  Thus, 
for  processing  any  finite  number  of  unlabeled  examples  m,  the  time  complexity  of  Algorithm 
2  (substituting  the  above  2-approximation  for  Dm(e;  C,  £),  which  only  changes  constant  factors 
in  the  results  of  Sectionl2.3.4l)  should  be  no  more  than  a  polynomial  factor  worse  than  the  time 
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complexity  of  empirical  risk  minimization  with  C,  for  the  worst  case  over  all  samples  of  size 

0(m2). 


2.9  A  Refined  Analysis  of  PAC  Learning  Via  the  Disagree¬ 
ment  Coefficient 


Throughout  this  section,  we  will  work  in  “Jlealizable( C)  and  denote  V  =  TXy[T].  In  particular, 
there  is  always  a  target  function  /  e  C  with  er(f)  =  0. 

Note  that  the  known  general  upper  bound  for  this  problem  is  that,  if  the  VC  dimension  of  C 
is  d,  then  with  probability  1  —  5,  every  classifier  in  C  consistent  with  n  random  samples  has  error 
rate  at  most 


d\n(2en  /  d)  +  ln(4/5) 


This  is  due  to 
of 


n 


(2.13) 


Vapnik  [1982].  There  is  a  slightly  different  bound  (for  a  different  learning  strategy) 

dlog(l/<5) 


oc 


(2.14) 


proven  by 


Haussler.  Littlestone.  and  Warmuth  [  1994].  It  is  also  known  that  one  cannot  get  a 


distribution-free  bound  smaller  than 


oc 


d  +  log(l/5) 


n 


for  any  concept  space  IVapnik , 


1982].  The  question  we  are  concerned  with  here  is  deriving  upper 


bounds  that  are  closer  to  this  lower  bound  than  either  (12.13ft  or  (12.14ft  in  some  cases. 

For  our  purposes,  throughout  this  section  we  will  take  r0  =  d+{o'^1/5')  in  the  definition  of  the 
disagreement  coefficient.  In  particular,  recall  that  6f  <  d-  always,  and  this  will  imply  a  fallback 


guarantee  no  worse  than  those  above  for  our  analysis  below.  However,  it  is  sometimes  much 
smaller,  or  even  constant,  in  which  case  our  analysis  here  may  be  better  than  those  mentioned 
above. 
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2.9.1  Error  Rates  for  Any  Consistent  Classifier 


For  simplicity  and  to  focus  on  the  nontrivial  cases,  the  results  in  this  section  will  be  stated  for 


the  case  where  F(DIS(C))  >  0.  The  F(DIS(C))  =  0  case  is  trivial,  since  every  h  G  C  has 
er(h )  =  0  there. 

Theorem  2.23.  Let  d  be  the  VC  dimension  of  concept  space  C,  and  let 
Vn  —  {h  G  C  :  Vi  <  n,  h(xi )  =  where  f  G  C  is  the  target  function  (i.e.,  er(f)  =  0), 

and  (x\,X2,  ■  ■  ■ ,  xn)  ~  V"  is  a  sequence  ofi.i.d.  training  examples.  Then  for  any  5  G  (0, 1), 
with  probability  >  1  —  5,  V7i  G  Vn, 


Proof  Since  F(DIS( C))  >  0  by  assumption,  Of  >  0  (and  d  >  0  also  follows).  As  above,  let 
Vm  —  {h  G  C  :  Vi  <  m,  h(xi )  =  /(xj)},  and  define  radius(Vm )  =  suphgVm  er(h).  We  will 
prove  the  result  by  induction  on  n.  As  a  base  case,  note  that  the  result  clearly  holds  for  n  <  d,  as 
we  always  have  er(h)  <  1. 

Now  suppose  n  >  d  +  1  >  2,  and  suppose  the  result  holds  for  any  m  <  n;  in  particular, 
consider  m  =  [n/2\ .  Thus,  for  any  5  G  (0, 1),  with  probability  >1  —  5/3, 


Note  that  rn  <  rm,  so  we  can  take  this  inequality  to  hold  for  the  Of  defined  with  rn  as  well.  If 
P (DIS(Vm))  <  In  |  In  |,  then  <12. 151)  is  valid  (as  is  (12.161)  below)  since  radius(Vn )  < 
radius(Vm )  <  P (DIS(Vm)).  Otherwise,  by  a  Chemoff  bound,  with  probability  >  1  —  5/3,  we 
have 


\{xm+1,xm+2,  ■  ■  ■  ,xn}  0  DIS{Vm) |  >  P {DIS(Vm))\n/2]/2  =:  N. 
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(12.131)  tells  us  that  given  this  event,  with  probability  >  1  —  5/3, 


radius(Vn )  =  ¥{DIS{Vm))radius(Vn\DIS(Vm )) 

u4  /  2eN  ,  12\  16  /  2eP(DJS,(Vrm))ra  ,  12 

<  p( w-» ^  —  +  >" T)  ^  -  (-*■-  4/  +  >” T 

16  /  ,,  e9 f radius (Vm)n  ,  12 

r111 — m — +  ”7 


Applying  the  inductive  hypothesis  for  radius(Vm )  combined  with  a  union  bound  over  these  3 
failure  events  (each  of  probability  5/3),  we  have  that  with  probability  >1  —  5, 


16  f  f  f  36 

radius(Vn)  <  —  id  In  (4866*/  (in  (8800/)  +  -  In  — 


,  12 
111 T 


(2.16) 


If  d  >  -  In  ■¥,  then  the  right  side  of  (12.161)  is  at  most 


^  (d  In  (0/  48e  In  (880  ■  3  •  ee9f ))  +  In  ^ 


<  ^  (din  (9f 48eln  (400080/))  +  In  y 


<  —  Id  In  (26O990f2)  +  In  ^  )  <  —  Id  In  (8800/)  +  In  ^ 


n  V 


12 


24 

n 


12 


Otherwise  d  <  -  In  //,  so  that  the  right  side  of  (12.161)  is  at  most 


—  f  din  f  0/48e  In  (880  •  30/)  1  In  ^  j  +  In  ^ 


<  —  ( d In  ( 6705 93/2)  +  din  ( 1  In  ^  +  In 


n 


d 


5 


24/  2/1  \  12\  24  /  12 

<  —  (  din  (3569 f)  +  -  -  +  1  In  —  <  —  (  din  (880 0f)  +  In  — 

n  \  3  \e  o  n  \  5 


The  theorem  now  follows  by  the  principle  of  induction. 


□ 


With  this  result  in  hand,  we  can  immediately  get  some  interesting  results,  such  as  the  follow¬ 
ing  corollary. 
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Corollary  2.24.  Suppose  C  is  the  space  of  linear  separators  in  d  dimensions  that  pass  through 
the  origin,  and  suppose  the  distribution  is  uniform  on  the  surface  of  the  origin-centered  unit 
sphere.  Then  with  probability  >1  —  5,  any  he  C  consistent  with  the  n  i.i.d.  training  examples 
has  (for  some  finite  universal  c) 


er(h)  <  c 


d  log  d  +  log 
n 


1 

5 


Proof  liHannckc. 


2007b]  proves  that  sup  Of  <  n \fd  for  this  problem. 

fee 


□ 


This  improves  over  the  best  previously  known  bound  for  consistent  classifiers  for  this  problem 
in  its  dependence  on  n,  which  was  oc  dVlogb7^)+log(i/<?)  ancj  Long^  2007]  (though  we  picked 
up  an  extra  log  d  factor  in  the  process). 


2.9.2  Specializing  to  Particular  Algorithms 

The  above  analysis  is  for  arbitrary  algorithms  that  select  a  classifier  consistent  with  the  training 
data.  However,  we  can  modify  the  disagreement  coefficient  to  be  more  interesting  for  more  spe¬ 
cific  algorithms.  Specifically,  suppose  there  are  sets  C /  such  that  with  high  probability  algorithm 
A  will  output  a  classifier  in  Cf  when  /  is  the  target  function.  Then  we  only  need  to  worry  about 
the  regions  of  disagreement  within  these  Cf  sets,  which  may  be  significantly  smaller  than  within 
the  full  space  C. 

To  give  a  concrete  example,  consider  the  Closure  algorithm:  output  the  h  E  C  with  smallest 
P(/i(X)  =  +1)  that  is  consistent  with  the  data.  For  intersection-closed  C,  the  sets  are  C /  = 
{he  C  :  h(x)  —  +1  =>  f(x)  =  +1}.  So  effectively,  this  becomes  our  concept  space,  and  the 
disagreement  coefficient  of  /  with  respect  to  Cf  and  V  can  be  significantly  smaller  than  it  is  with 
respect  to  the  full  space  C.  For  instance,  if  C  is  axis-aligned  rectangles,  then  the  disagreement 
coefficient  of  any  /  E  C  with  respect  to  Cf  and  V  is  at  most  d.  This  implies  a  bound 

dlogd  +  log(l/5) 

oc - . 

n 
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We  already  have  better  bounds  than  this  for  using  Closure  with  this  concept  space.  How¬ 
ever,  if  the  d  upper  bound  on  disagreement  coefficient  with  respect  to  C/  is  true  for  general 
intersection-closed  spaces  C,  this  would  match  the  best  known  bounds  for  general  intersection- 


closed  spaces  I  Auer  and  Ortner . 


20041] . 
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Chapter  3 


Significance  of  the  Verifiable/Unverifiable 
Distinction  in  Realizable  Active  Learning 


This  chapter  describes  and  explores  a  new  perspective  on  the  label  complexity  of  active  learning 
in  the  fixed-distribution  realizable  case.  In  many  situations  where  it  was  generally  thought  that 
active  learning  does  not  help,  we  show  that  active  learning  does  help  in  the  limit,  often  with 
exponential  improvements  in  label  complexity.  This  contrasts  with  the  traditional  analysis  of 
active  learning  problems  such  as  non-homogeneous  linear  separators  or  depth-limited  decision 
trees,  in  which  fl(l/e)  lower  bounds  are  common.  Such  lower  bounds  should  be  interpreted 
carefully;  indeed,  we  prove  that  it  is  always  possible  to  learn  an  e-good  classifier  with  a  number 
of  labels  asymptotically  smaller  than  this.  These  new  insights  arise  from  a  subtle  variation  on 
the  traditional  definition  of  label  complexity,  not  previously  recognized  in  the  active  learning 
literature. 


Remark  3.1.  The  results  in  this  chapter  are  taken  from  /Ha  I  can.  Hanneke.  and  Wortman 


200811. 


joint  work  with  Maria-Florina  Balcan  and  Jennifer  Wortman. 
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3.1  Introduction 


A  number  of  active  learning  analyses  have  recently  been  proposed  in  a  PAC-style  setting,  both  for 


the  realizab 

tive  results  1 

e  and  for  the  agnostic  cases,  resulting  in  a  sequence  of  important  positive  and  nega- 

Balcan  et  al. 

20C 

6 

2007 

Cohn  et  al. 

1994 

Dasgupta 

2004, 

2005 

Dasgupta  et  al.. 

2005, 

2007, 

HannekJ, 

2007a 

b 

|.  In  particular,  the  most  concrete  noteworthy  positive  result  for 

when  active  learning  helps  is  that  of  learning  homogeneous  (i.e.,  through  the  origin)  linear 


separators,  when  the  data  is  linearly  separable  and  distributed  uniform 


and  this  examp 


Dasgupta  etal 


e  has  been  extensively  analyzed  IBalcan  et  al.. 


2m, 


2006, 


y  over  the_unit  sphere 


2007 


Dasgupta, 


2005 


20071] .  However,  few  other  positive  results  are  known,  and  there  are  sim¬ 


ple  (almost  trivial)  examples,  such  as  learning  intervals  or  non-homogeneous  linear  separators 
under  the  uniform  distribution,  where  previous  analyses  of  label  complexities  have  indicated  that 


perhaps  active  learning  does  not  help  at  all  [Dasgupta, 


mm. 


In  this  work,  we  approach  the  analysis  of  active  learning  algorithms  from  a  different  angle. 
Specifically,  we  point  out  that  traditional  analyses  have  studied  the  number  of  label  requests 
required  before  an  algorithm  can  both  produce  an  e-good  classifier  and  prove  that  the  classifier’s 
error  is  no  more  than  e.  These  studies  have  turned  up  simple  examples  where  this  number  is 
no  smaller  than  the  number  of  random  labeled  examples  required  for  passive  learning.  This  is 
the  case  for  learning  certain  nonhomogeneous  linear  separators  and  intervals  on  the  real  line, 
and  generally  seems  to  be  a  common  problem  for  many  learning  scenarios.  As  such,  it  has  led 
some  to  conclude  that  active  learning  does  not  help  for  most  learning  problems.  One  of  the  goals 
of  our  present  analysis  is  to  dispel  this  misconception.  Specifically,  we  study  the  number  of 
labels  an  algorithm  needs  to  request  before  it  can  produce  an  e-good  classifier,  even  if  there  is 
no  accessible  confidence  bound  available  to  verify  the  quality  of  the  classifier.  With  this  type 
of  analysis,  we  prove  that  active  learning  can  essentially  always  achieve  asymptotically  superior 
label  complexity  compared  to  passive  learning  when  the  VC  dimension  is  finite.  Furthermore, 
we  find  that  for  most  natural  learning  problems,  including  the  negative  examples  given  in  the 
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Figure  3.1:  Active  learning  can  often  achieve  exponential  improvements,  though  in  many  cases 
the  amount  of  improvement  cannot  be  detected  from  information  available  to  the  learning  algo¬ 
rithm.  Here  7  may  be  a  target-dependent  constant. 


s  0  i 


previous  literature,  active  learning  can  achieve  exponential  improvements  over  passive  learning 
with  respect  to  dependence  on  e.  This  situation  is  characterized  in  Figure l3~Tl 

To  our  knowledge,  this  is  the  first  work  to  address  this  subtle  point  in  the  context  of  active 
learning.  Though  several  previous  papers  have  studied  bounds  on  this  latter  type  of  label  com¬ 


plexity  f  Castro  and  Nowak . 


2007, 


Dasgupta  et  al 


12005 


20071.  their  results  were  no  stronger 


than  the  results  one  could  prove  in  the  traditional  analysis.  As  such,  it  seems  this  large  gap 
between  the  two  types  of  label  complexities  has  gone  unnoticed  until  now. 


3.1.1  A  Simple  Example:  Intervals 

To  get  some  intuition  about  when  these  types  of  label  complexity  are  different,  consider  the 
following  example.  Suppose  that  C  is  the  class  of  all  intervals  over  [0, 1]  and  V  is  a  uniform 
distribution  over  [0,1].  If  the  target  function  is  the  empty  interval,  then  for  any  sufficiently  small 
e,  in  order  to  verify  with  high  confidence  that  this  (or  any)  interval  has  error  <  e,  we  need  to 
request  labels  in  at  least  a  constant  fraction  of  the  0(l/e)  intervals  [0,  2e],  [2e,  4e], . . .,  requiring 
0(  1/e)  total  label  requests. 

1  We  slightly  abuse  the  term  “exponential”  throughout  the  chapter.  In  particular,  we  refer  to  any  polylog{  1/e)  as 
being  an  exponential  improvement  over  1/e. 
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However,  no  matter  what  the  target  function  is,  we  can  find  an  e-good  classifier  with  only 
a  logarithmic  label  complexity  via  the  following  extremely  simple  2-phase  learning  algorithm. 
The  algorithm  will  be  allowed  to  make  t  label  requests,  and  then  we  will  find  a  value  of  t  that  is 
sufficiently  large  to  guarantee  learning.  We  start  with  a  large  (0(2^))  set  of  unlabeled  examples. 
In  the  first  phase,  on  each  round  we  choose  a  point  x  uniformly  at  random  from  the  unlabeled 
sample  and  query  its  label.  We  repeat  this  until  we  either  observe  a  +1  label,  at  which  point  we 
enter  the  second  phase,  or  we  use  all  t  label  requests.  In  the  second  phase,  we  alternate  between 
running  one  binary  search  on  the  examples  between  0  and  that  x  and  a  second  on  the  examples 
between  that  x  and  1  to  approximate  the  end-points  of  the  interval.  Once  we  use  all  t  label 
requests,  we  output  a  smallest  interval  consistent  with  the  observed  labels. 


If  the  target  h*  labels  every  point  as  —1  (the  so-called  all-negative  function),  the  algorithm 
described  above  would  output  a  hypothesis  with  0  error  even  after  0  label  requests,  so  any  t  >  0 
suffices  in  this  case.  On  the  other  hand,  if  the  target  is  an  interval  [a,  b]  C  [0, 1],  where  b  —  a  = 
w  >  0,  then  after  roughly  0(l/w)  queries  (a  constant  number  that  depends  only  on  the  target),  a 
positive  example  will  be  found.  Since  only  0(log(l/e))  additional  queries  are  required  to  run  the 
binary  search  to  reach  error  rate  e,  it  suffices  to  have  t  >  0(l/u>+log(l/e))  =  0(log(l/e)).  So  in 
general,  the  label  complexity  is  at  worst  0(log(l/e)).  Thus,  we  see  a  sharp  distinction  between 
the  label  complexity  required  to  find  a  good  classifier  (logarithmic)  and  the  label  complexity 
needed  to  both  find  a  good  classifier  and  verify  that  it  is  good. 


This  example  is  particularly  simple,  since  there  is  effectively  only  one  “hard”  target  function 
(the  all-negative  target).  However,  most  of  the  spaces  we  study  are  significantly  more  complex 
than  this,  and  there  are  generally  many  targets  for  which  it  is  difficult  to  achieve  good  verifiable 
complexity. 
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3.1.2  Our  Results 


We  show  that  in  many  situations  where  it  was  previously  believed  that  active  learning  cannot 
help,  active  learning  does  help  in  the  limit.  Our  main  specific  contributions  are  as  follows: 

•  We  distinguish  between  two  different  variations  on  the  definition  of  label  complexity.  The 
traditional  definition,  which  we  refer  to  as  verifiable  label  complexity ,  focuses  on  the  num¬ 
ber  of  label  requests  needed  to  obtain  a  confidence  bound  indicating  an  algorithm  has 
achieved  at  most  e  error.  The  newer  definition,  which  we  refer  to  simply  as  label  complex¬ 
ity ,  focuses  on  the  number  of  label  requests  before  an  algorithm  actually  achieves  at  most 
e  error.  We  point  out  that  the  latter  is  often  significantly  smaller  than  the  former,  in  con¬ 
trast  to  passive  learning  where  they  are  often  equivalent  up  to  constants  for  most  nontrivial 
learning  problems. 

•  We  prove  that  any  distribution  and  finite  VC  dimension  concept  class  has  active  learning 
label  complexity  asymptotically  smaller  than  the  label  complexity  of  passive  learning  for 
nontrivial  targets.  A  simple  corollary  of  this  is  that  finite  VC  dimension  implies  o(l/e) 
active  learning  label  complexity. 

•  We  show  it  is  possible  to  actively  learn  with  an  exponential  rate  a  variety  of  concept  classes 
and  distributions,  many  of  which  are  known  to  require  a  linear  rate  in  the  traditional  anal¬ 
ysis  of  active  learning:  for  example,  intervals  on  [0, 1]  and  non-homogeneous  linear  sepa¬ 
rators  under  the  uniform  distribution. 

•  We  show  that  even  in  this  new  perspective,  there  do  exist  lower  bounds;  it  is  possible  to 
exhibit  somewhat  contrived  distributions  where  exponential  rates  are  not  achievable  even 
for  some  simple  concept  spaces  (see  Theorem  13.  Ilk  The  learning  problems  for  which 
these  lower  bounds  hold  are  much  more  intricate  than  the  lower  bounds  from  the  traditional 
analysis,  and  intuitively  seem  to  represent  the  core  of  what  makes  a  hard  active  learning 
problem. 
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3.2  Background  and  Notation 


In  various  places  throughout  this  chapter,  we  will  need  notation  for  a  countable  dense  subset  of 
a  hypothesis  class  V.  For  any  set  of  classifiers  V,  we  will  denote  by  V  a  countable  (or  possibly 
finite)  subset  of  V  s.t.  Va  >  0,  Vh  e  V,  3h'  &  V  with  ¥x>XY[x](h(X)  f  h'(X))  <  a.  Such 
a  set  is  guaranteed  to  exist  under  mild  conditions;  in  particular,  finite  VC  dimension  suffices  to 
guarantee  its  existence.  We  introduce  this  notion  to  avoid  certain  degenerate  behaviors,  such  as 
when  DIS(B(h,  0))  =  X.  For  instance,  the  hypothesis  class  of  classifiers  on  the  [0, 1]  interval 
that  label  exactly  one  point  positive  has  this  property  under  any  nonatomic  density  function. 

Since  all  of  the  results  in  this  chapter  are  for  the  fixed-distribution  realizable  case,  it  will  be 
convenient  to  introduce  the  following  short-hand  notation. 

Definition  3.1.  A  function  A(e,  5,  h*)  is  a  label  complexity  for  a  pair  (C,  V)  if  there  exists  an 
active  learning  algorithm  A  achieving  label  complexity  A(e,  6,  T>xy )  =  A(e,  5,  h*vXY)  for  all 
T>xy  £  Realizable^,  V),  where  V  is  a  distribution  over  X  and  h*vXY  the  target  function 
under  T>xy- 

Definition  3.2.  A  function  A(e,  5,  h*)  is  a  verifiable  label  complexity  for  a  pair  (C,  V)  if  there 
exists  an  active  learning  algorithm  A  achieving  verifiable  label  complexity 
A(e,  <5,  T>xy)  =  A(e,  <5,  h*x>XY)for  all  T>xy  £  “Realizable (C,  V),  where  V  is  a  distribution  over 
X  and  b*x>XY  is  the  target  function  under  T>xy- 

Let  us  take  a  moment  to  reflect  on  the  difference  between  the  two  definitions  of  label  com¬ 
plexity:  namely,  verifiable  and  unverifiable.  The  distinction  may  appear  quite  subtle.  Both 
definitions  allow  the  label  complexity  to  depend  both  on  the  target  function  and  on  the  input  dis¬ 
tribution.  The  only  distinction  is  whether  or  not  there  is  an  accessible  guarantee  or  confidence 
bound  on  the  error  of  the  chosen  hypothesis  that  is  also  at  most  e.  This  confidence  bound  can 
only  depend  on  quantities  accessible  to  the  learning  algorithm,  such  as  the  t  requested  labels.  As 
an  illustration  of  this  distinction,  consider  again  the  problem  of  learning  intervals.  As  described 
above,  if  the  target  h*  is  an  interval  of  width  w,  then  after  seeing  0(1 /w  +  log(l/e))  labels,  with 
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high  probability  it  is  possible  for  an  algorithm  to  guarantee  that  it  can  output  a  function  with 
error  less  than  e.  In  this  case,  for  sufficiently  small  e,  the  verifiable  label  complexity  A(e,  5,  h*) 
is  proportional  to  log(l/e).  However,  if  h*  is  the  all-negative  function,  then  the  verifiable  label 
complexity  is  at  least  proportional  to  1/e  for  all  values  of  e  because  a  high-confidence  guarantee 
can  never  be  made  without  observing  fi(  1/e)  labels;  for  completeness,  a  formal  proof  of  this  fact 
is  included  in  Section  1X71  In  contrast,  as  we  have  seen,  the  label  complexity  is  0(log(l/e))  for 
any  target  in  the  class  of  intervals  when  no  such  guarantee  is  required. 

A  common  alternative  formulation  of  verifiable  label  complexity  is  to  let  A  take  e  as  an 


argument  and  allow  it  to  c 


error  at  most  e  [Dasgupta. 


loose  online  how  many  label  requests  it  needs  in  order  to  guarantee 


20051.  This  alternative  definition  is  almost  equivalent  (an  algorithm 


for  either  definition  can  be  modified  to  fit  the  other  definition  without  significant  loss  in  the 
verifiable  label  complexity  values),  as  the  algorithm  must  be  able  to  produce  a  confidence  bound 
of  size  at  most  e  on  the  error  of  its  hypothesis  in  order  to  decide  when  to  stop  requesting  labels 
anywayP 


3.2.1  The  Verifiable  Label  Complexity 


To  date,  there  has  been  a  significant  amount  of  work  studying  the  verifiable  label  complexity 
(though  typically  under  the  aforementioned  alternative  formulation).  It  is  clear  from  standard  re¬ 
sults  in  passive  learning  that  verifiable  label  complexities  of  O  ((d/e)  log(l/e)  +  (1/e)  log(l/5)) 
are  easy  to  obtain  for  any  learning  problem,  by  requesting  the  labels  of  random  examples.  As 
such,  there  has  been  much  interest  in  determining  when  it  is  possible  to  achieve  verifiable  la- 

2There  is  some  question  as  to  what  the  “right”  formal  model  of  active  learning  is  in  general.  For  instance,  we 
could  instead  let  A  generate  an  infinite  sequence  of  ht  hypotheses  (or  ( ht,et )  in  the  verifiable  case),  where  ht 
can  depend  only  on  the  first  t  label  requests  made  by  the  algorithm  along  with  some  initial  segment  of  unlabeled 


examples  (as  in 


Castro  and  Nowak. 


2007]),  representing  the  case  where  we  are  not  sure  a-priori  of  when  we  will 


stop  the  algorithm.  However,  for  our  present  purposes,  such  alternative  models  are  equivalent  in  label  complexity 
up  to  constants. 
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bel  complexity  smaller  than  this,  and  in  particular,  when  the  verifiable  label  complexity  is  a 
polylogarithmic  function  of  1/e  (representing  exponential  improvements  over  passive  learning). 

As  discussed  in  previous  chapters,  there  have  been  a  few  quantities  proposed  to  measure 
the  verifiable  label  complexity  of  active  learning  on  any  given  concept  class  and  distribution. 


Dasgupta’s  splitting  index  [Dasgupta, 


2005 1 .  which  is  dependent  on  the  concept  class,  data  dis¬ 


tribution,  target  function,  and  a  parameter  r,  quantifies  how  easy  it  is  to  make  progress  toward 


reducing  the  diameter  of  the  version  space  by  choosing  an  examp 


e  to  query.  Another  quantity 


2007b|,  defined  in 


to  which  we  will  frequently  refer  is  the  disagreement  coefficient  HHanneke , 

Chapter  |3 

The  disagreement  coefficient  is  often  a  useful  quantity  for  analyzing  the  verifiable  label  com¬ 
plexity  of  active  learning  algorithms.  For  example,  as  we  saw  in  Chapter  |2  Algorithm  0  achieves 
a  verifiable  label  complexity  at  most  0h*d  ■  polylog(l/(e<5))  when  run  with  hypothesis  class  C  for 
target  function  h*  G  C.  We  will  use  it  in  several  of  the  results  below.  In  all  of  the  relevant  results 
of  this  chapter,  we  will  simply  take  r0  =  0  in  the  definition  of  the  disagreement  coefficient. 

We  will  see  that  both  the  disagreement  coefficient  and  splitting  index  are  also  useful  quantities 
for  analyzing  unverifiable  label  complexities,  though  their  use  in  that  case  is  less  direct. 


3.2.2  The  True  Label  Complexity 

This  chapter  focuses  on  situations  where  true  label  complexities  are  significantly  smaller  than 
verifiable  label  complexities.  In  particular,  we  show  that  many  common  pairs  (C,V)  have 
label  complexity  that  is  polylogarithmic  in  both  1/e  and  1/5  and  linear  only  in  some  finite 


target-dependent  constant  7 h * .  This  contrasts  sharply  with  the  infamous 


tioned  above,  which  have  been  identified  for  verifiable  label  complexity  IDasgupta , 


Freund  et  al. 

1997 

Hanngkg 

2007a] 

/ e  lower  bounc 


2004 


s  men- 


2005 


2007a].  The  implication  is  that,  for  any  fixed  target  h*,  such  lower 


bounds  vanish  as  e  approaches  0.  This  also  contrasts  with  passive  learning,  where  1/e  lower 


bounds  are  typically  unavoidable  [Antos  and  Lugosi, 


19931. 
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Definition  3.3.  We  say  that  (C,  V)  is  actively  learnable  at  an  exponential  rate  if  there  exists  an 
active  learning  algorithm  achieving  label  complexity 

A(e,  5,  h*)  =7h*  •  polylog  (l/(efl)) 

for  all  h*  G  C,  where  yh*  is  a  finite  constant  that  may  depend  on  h*  and  V  but  is  independent  of 
e  and  5. _ 

3.3  Strict  Improvements  of  Active  Over  Passive 


In  this  section,  we  describe  conditions  under  which  active  learning  can  achieve  a  label  complexity 
asymptotically  superior  to  passive  learning.  The  results  are  surprisingly  general,  indicating  that 
whenever  the  VC  dimension  is  finite,  essentially  any  passive  learning  algorithm  is  asymptotically 
dominated  by  an  active  learning  algorithm  on  all  targets. 

Definition  3.4.  A  function  A(e,  5,  h*)  is  a  passive  learning  label  complexity  for  a  pair  (C,  V)  if 
there  exists  an  algorithm  Aiffixi,  h*{x i)),  (x2,  h*(x2)), . . . ,  (xt,  h*(xt))),  5)  that  outputs  a 
classifier  ht}s,  such  that  for  any  target  function  h*  G  C,  e  G  (0,1/2),  S  G  (0,1),  for  any 
t  >  A(e,  S,  h*), 

^T>(er(httS)  <  e)  >  1  -  8. 


Thus,  a  passive  learning  label  complexity  corresponds  to  a  restriction  of  an  active  learning 
label  complexity  to  algorithms  that  specifically  request  the  first  t  labels  in  the  sequence  and 
ignore  the  rest.  In  particular,  it  is  known  that  for  any  finite  VC  dimension  class,  there  is  always 


an  O  (1/e)  passive  learning  label  complexity  IHaussler  et  al 


19941.  Furthermore,  this  is  often 


(though  not  always)  tight,  in  the  sense  that  for  any  passive  algorithm,  there  exist  targets  for  which 


the  corresponding  passive  learning  label  complexity  is  O  ( 1/e)  [Antos  and  Lugosi. 


199881.  The 


following  theorem  states  that  for  any  passive  learning  label  complexity,  there  exists  an  achievable 
active  learning  label  complexity  with  a  strictly  slower  asymptotic  rate  of  growth.  Its  proof  is 
included  in  Section  13.1 11 

Remark  3.2.  This  result  is  superceded  by  a  stronger  result  in  Chapter  0  however,  the  result  in 
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Chapter^  is  proven  for  a  different  algorithm,  so  that  Theorem  13.51  is  not  entirely  redundant.  I 
have  therefore  chosen  to  include  the  result,  since  the  construction  of  the  algorithm  may  be  of 
independent  interest,  even  if  the  stated  theorem  is  itself  weaker  than  later  results. 


Theorem  3.5.  Suppose  C  has  finite  VC  dimension,  and  let  V  be  any  distribution  on  X.  For  any 
passive  learning  label  complexity  Ap(e,  <5,  h)  for  (C,  V),  there  exists  an  active  learning 
algorithm  achieving  a  label  complexity  Aa(e,  5 ,  h)  such  that,  for  all  5  G  (0,1/4)  and  targets 
h*  G  C  for  which  Ap(e,  <5,  h *)  =  cu(l), 

_ Aa(e,  S,  h*)  =  o(Ap(e/4,6,h*)). _ 

In  particular,  this  implies  the  following  simple  corollary. 


Corollary  3.6.  For  any  C  with  finite  VC  dimension,  and  any  distribution  T>  over  X,  there  is  an 
active  learning  algorithm  that  achieves  a  label  complexity  A(e,  <5,  h*)  such  that  for  5  G  (0, 1/4), 

A(e,  5,  h *)  =  o  (1/e) 


for  all  targets  he  C. 


Proof.  Let  d  be  the  VC  dimension  of  C.  The  passive  learning  algorithm  of  Haussler,  Little- 


stone  &  Warmuth  |  Haussler  et  al 


1994]  is  known  to  achieve  a  label  complexity  no  more  than 


( kd/e )  log(l/<5),  for  some  universal  constant  k  <  200.  Applying  Theorem  13 .51  now  implies  the 
result.  □ 


Note  the  interesting  contrast,  not  only  to  passive  learning,  but  also  to  the  known  results  on  the 
verifiable  label  complexity  of  active  learning.  This  theorem  definitively  states  that  the  fl  (1/e) 
lower  bounds  common  in  the  literature  on  verifiable  label  complexity  can  never  arise  in  the 
analysis  of  the  true  label  complexity  of  finite  VC  dimension  classes. 
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3.4  Decomposing  Hypothesis  Classes 


Let  us  return  once  more  to  the  simple  example  of  learning  the  class  of  intervals  over  [0, 1]  under 
the  uniform  distribution.  As  discussed  above,  it  is  well  known  that  the  verifiable  label  complexity 
of  the  all-negative  classifier  in  this  class  is  fl(l/e).  However,  consider  the  more  limited  class 
C'  C  C  containing  only  the  intervals  h  of  width  t/y  strictly  greater  than  0.  Using  the  simple 
algorithm  described  in  Sectionl3.1.11  this  restricted  class  can  be  learned  with  a  (verifiable)  label 
complexity  of  only  0(l/wh  +  log(l/e)).  Furthermore,  the  remaining  set  of  classifiers  C"  = 
C  \  C'  consists  of  only  a  single  function  (the  all-negative  classifier)  and  thus  can  be  learned  with 
verifiable  label  complexity  0.  Here  we  have  that  C  can  be  decomposed  into  two  subclasses  C 
and  C",  where  both  (C',  V)  and  (C".  V)  are  learnable  at  an  exponential  rate.  It  is  natural  to 
wonder  if  the  existence  of  such  a  decomposition  is  enough  to  imply  that  C  itself  is  learnable  at 
an  exponential  rate. 


More  generally,  suppose  that  we  are  given  a  distribution  V  and  a  hypothesis  class  C  such 
that  we  can  construct  a  sequence  of  subclasses  C*  with  label  complexity  A,(e,  S,  h ),  with  C  = 
U“1Cj.  Thus,  if  we  knew  a  priori  that  the  target  h*  was  a  member  of  subclass  Q,  it  would  be 
straightforward  to  achieve  A,(e,  S,  h*)  label  complexity.  It  turns  out  that  it  is  possible  to  learn  any 
target  h*  in  any  class  Q  with  label  complexity  only  0( Aj(e/2,  S/2,  h*)),  even  without  knowing 
which  subclass  the  target  belongs  to  in  advance.  This  can  be  accomplished  by  using  a  simple 
aggregation  algorithm,  such  as  the  one  given  below.  Here  a  set  of  active  learning  algorithms 


(for  example,  multiple  instances  of  Dasgupta’s  splitting  algorithm  |Dasgupta. 


200511  or  CAL)  are 


run  on  individual  subclasses  C,  in  parallel.  The  output  of  one  of  these  algorithms  is  selected 
according  to  a  sequence  of  comparisons. 


Using  this  algorithm,  we  can  show  the  following  label  complexity  bound.  The  proof  appears 
in  Section  [L8l 
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Algorithm  1  Algorithm  4  :  The  Aggregation  Procedure.  Here  it  is  assumed  that  C  =  U^Q, 

and  that  for  each  i,  Aj  is  an  algorithm  achieving  label  complexity  at  most  A*(e,  8,  h )  for  the  pair 

(Q,  V).  Both  the  main  aggregation  procedure  and  each  algorithm  A,  take  a  number  of  labels  t 

and  a  confidence  parameter  8  as  parameters. 

Let  k  be  the  largest  integer  s.t.  k 2  [72  ln(4A:/5)]  <  t/2 

for  i  —  1, . . . ,  k  do 

Let  hi  be  the  output  of  running  Ai(\t/{4i2)\,8/2)  on  the  sequence  {x2n- i}^Li 

end  for 

for  i,j  e  {1,2, ...  ,k}  do 

if  Fv(hi(x)  f  hj(x ))  >  0  then 

Let  Rij  be  the  first  [72  ln(4 k/8f\  elements  x  in  the  sequence  {x2n}[[Li  s.t.  hfx)  f  hj{x) 
Request  the  labels  of  all  examples  in  R,l3 

Let  rriij  be  the  number  of  elements  in  R,tJ  on  which  hi  makes  a  mistake 
else 

Let  rriij  =  0 

end  if 
end  for 

Return  ht  =  hi  where  i  =  argrnin  max  mu 


Theorem  3.7.  For  any  distribution  V,  let  Ci,  C2, . . .  be  a  sequence  of  classes  such  that  for  each 
i,  the  pair  (Q,  V)  has  label  complexity  at  most  A,(e,  5,  h)  for  all  h  G  Q.  Let  C  =  U[L1Cj.  Then 
(C,  V)  has  a  label  complexity  at  most 

min  max  <  4 i2  [Aj(e/2,  S/2,  h){  ,  2 i2 
i:h£Ci 

for  any  h  E  C.  In  particular,  Algorithm  4  achieves  this  when  given  as  input  the  algorithms  A* 
that  each  achieve  label  complexity  A*(e,  <5,  h)  on  class  (Q,  V). 

A  particularly  interesting  implication  of  Theorem  13 .71  is  that  the  ability  to  decompose  C  into 
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a  sequence  of  classes  Q  with  each  pair  (Q,  D)  leamable  at  an  exponential  rate  is  enough  to 
imply  that  (C,  V)  is  also  learnable  at  an  exponential  rate.  Since  the  verifiable  label  complexity 
of  active  learning  has  received  more  attention  and  is  therefore  better  understood,  it  is  often  be 
useful  to  apply  this  result  when  there  exist  known  bounds  on  the  verifiable  label  complexity;  the 
approach  loses  nothing  in  generality,  as  suggested  by  the  following  theorem.  The  proof  of  this 
theorem,  is  included  in  Section l3~9l 

Theorem  3.8.  For  any  (C,  V)  learnable  at  an  exponential  rate,  there  exists  a  sequence 
Ci,  C2,  •  •  •  with  C  =  U^Q,  and  a  sequence  of  active  learning  algorithms  Ai,  A2, . . .  such  that 
the  algorithm  Ai  achieves  verifiable  label  complexity  at  most  ytpolylogt  (1  /(eS))for  the  pair 
(Q,  V),  where  7 i  is  a  constant  independent  ofe  and  5.  In  particular,  the  aggregation  algorithm 

(Algorithm  4  )  achieves  exponential  rates  when  used  with  these  algorithms. 

Note  that  decomposing  a  given  C  into  a  sequence  of  Q  subsets  that  have  good  verifiable  label 

complexities  is  not  always  a  simple  task.  One  might  be  tempted  to  think  a  simple  decomposi¬ 
tion  based  on  increasing  values  of  verifiable  label  complexity  with  respect  to  (C,  V)  would  be 
sufficient.  However,  this  is  not  always  the  case,  and  generally  we  need  to  use  information  more 
detailed  than  verifiable  complexity  with  respect  to  (C,  V)  to  construct  a  good  decomposition. 
We  have  included  in  Section  13.101  a  simple  heuristic  approach  that  can  be  quite  effective,  and  in 
particular  yields  good  label  complexities  for  every  (C,  V)  described  in  Section  1331 

Since  it  is  more  abstract  and  allows  us  to  use  known  active  learning  algorithms  as  a  black 
box,  we  frequently  rely  on  the  decompositional  view  introduced  here  throughout  the  remainder 
of  the  chapter. 

3.5  Exponential  Rates 

The  results  in  SectionEOltell  us  that  the  label  complexity  of  active  learning  can  be  made  strictly 
superior  to  any  passive  learning  label  complexity  when  the  VC  dimension  is  finite.  We  now  ask 
how  much  better  that  label  complexity  can  be.  In  particular,  we  describe  a  number  of  concept 
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classes  and  distributions  that  are  leamable  at  an  exponential  rate,  many  of  which  are  known  to 
require  0(l/e)  verifiable  label  complexity. 

3.5.1  Exponential  rates  for  simple  classes 

We  begin  with  a  few  simple  observations,  to  point  out  situations  in  which  exponential  rates 
are  trivially  achievable;  in  fact,  in  each  of  the  cases  mentioned  in  this  subsection,  the  label 
complexity  is  actually  0(1). 

Clearly  if  |  A  |  <  oo  or  | C |  <  oo,  we  can  always  achieve  exponential  rates.  In  the  former  case, 
we  may  simply  request  the  label  of  every  x  in  the  support  of  V,  and  thereby  perfectly  identify 
the  target.  The  corresponding  7  =  \X\.  In  the  latter  case,  Algorithm  0  can  achieve  exponential 
learning  with  7  =  |C|  since  each  queried  label  will  reduce  the  size  of  the  version  space  by  at 
least  one. 

Less  obvious  is  the  fact  that  a  similar  argument  can  be  applied  to  any  countably  infinite 
hypothesis  class  C.  In  this  case  we  can  impose  an  ordering  hi,h2,  -  ■  ■  over  the  classifiers  in  C, 
and  set  C *  =  {hi}  for  all  i.  By  Theorem  13. 71  applying  the  aggregation  procedure  to  this  sequence 
yields  an  algorithm  with  label  complexity  A(e,  S,  hi)  =  2 i2  [72  ln(4 i/8)~\  =  0(1). 

3.5.2  Geometric  Concepts,  Uniform  Distribution 

Many  interesting  geometric  concepts  in  Mn  are  learnable  at  an  exponential  rate  if  the  underlying 
distribution  is  uniform  on  some  subset  of  W  .  Here  we  provide  some  examples;  interestingly, 
every  example  in  this  subsection  has  some  targets  for  which  the  verifiable  label  complexity  is 
n  (1/e).  As  we  see  in  Section  13.5.31  all  of  the  results  in  this  section  can  be  extended  to  many 
other  types  of  distributions  as  well. 

Unions  of  k  intervals  under  arbitrary  distributions:  Let  X  be  the  interval  [0, 1)  and  let  C(fc) 
denote  the  class  of  unions  of  at  most  k  intervals.  In  other  words,  C(k]  contains  functions  de- 
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scribed  by  a  sequence  (a0,  ai,  •  ■  ■  ,  a^),  where  a0  =  0,  ai  =  1,  £  <  2k  +  1,  and  a0,  ■  ■  •  ,  is  the 
(nondecreasing)  sequence  of  transition  points  between  negative  and  positive  segments  (so  x  is 
labeled  +1  iff  x  E  [a*,  ai+ 1)  for  some  odd  i).  For  any  distribution,  this  class  is  learnable  at  an 
exponential  rate  by  the  following  decomposition  argument.  First,  define  Ci  to  be  the  set  contain¬ 
ing  the  all-negative  function  along  with  any  functions  that  are  equivalent  given  the  distribution 
V.  Formally, 

Ci  =  {hE  C(fc)  :  F{h{X)  =  +1)  =  0}  . 

Clearly  Ci  has  verifiable  label  complexity  0.  For  i  =  2,  3, . . . ,  k  +  1,  let  Q  be  the  set  containing 
all  functions  that  can  be  represented  as  unions  of  i  —  1  intervals  but  cannot  be  represented  as 
unions  of  fewer  intervals.  More  formally,  we  can  inductively  define  each  Q  as 

C,:  =  {hE  C(fc)  :  3ti  E  C (i“1}  s.t.  P(h(X)  ±  h'(X))  =  0}  \  . 

For  i  >  1,  within  each  subclass  Ct,  for  each  h  E  C*  the  disagreement  coefficient  wrt  Q  is 
bounded  by  something  proportional  to  k  +  1  /w(h),  where  w(h)  is  the  weight  of  the  smallest 
positive  or  negative  interval  with  nonzero  weight.  Thus  running  Algorithm  0  with  C,  achieves 
polylogarithmic  (verifiable)  label  complexity  for  any  h  E  Q .  Since  C(-k;  =  uJC/C,,  by  Theo- 
rem!3.71  C<k)  is  learnable  at  an  exponential  rate. 


Ordinary  Binary  Classification  Trees:  Let  X  be  the  cube  [0, 1]",  V  be  the  uniform  distribution 
on  X,  and  C  be  the  class  of_binary  decision  trees  using  a  finite  number  of  axis-parallel  splits 


(see  e.g.,  Devroye  et  al.  [Devrove  et  al 


19961].  Chapter  20).  In  this  case,  in  the  same  spirit  as 


the  previous  example,  we  let  C,  be  the  set  of  decision  trees  in  C  distance  zero  from  a  tree  with 
i  leaf  nodes,  not  contained  in  any  C j  for  j  <  i.  For  any  i,  the  disagreement  coefficient  for  any 
h  E  Ci  (with  respect  to  (C i,T>))  is  a  finite  constant,  and  we  can  choose  C*  to  have  finite  VC 
dimension,  so  each  (C*,  V)  is  learnable  at  an  exponential  rate  (by  running  Algorithm  0  with  Q). 
By  Theoreml3.7l  (C,  V)  is  learnable  at  an  exponential  rate. 
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Linear  Separators 


Theorem  3.9.  Let  C  be  the  concept  class  of  linear  separators  in  n  dimensions,  and  let  V  be  the 
uniform  distribution  over  the  surface  of  the  unit  sphere.  The  pair  (C,  V)  is  learnable  at  an 
exponential  rate. 

Proof.  There  are  multiple  ways  to  achieve  this.  We  describe  here  a  simple  proof  that  uses  a  de¬ 
composition  as  follows.  Let  A (h)  be  the  probability  mass  of  the  minority  class  under  hypothesis 
h.  Let  Ci  be  the  set  containing  only  the  separators  h  with  \(h)  =  0,  let  C2  =  {h  E  C  :  A (h)  = 
1/2},  and  let  C3  =  C  \  (Ci  U  C2).  As  before,  we  can  use  a  black  box  active  learning  algorithm 
such  as  CAL  to  learn  within  the  class  C3.  To  prove  that  we  indeed  get  the  desired  exponential 
rate  of  active  learning,  we  show  that  the  disagreement  coefficient  of  any  separator  h  E  C3  with 
respect  to  (C3,V)  is  finite.  The  results  concerning  Algorithm  0  from  Chapter |2] then  immedi¬ 
ately  imply  that  C3  is  learnable  at  an  exponential  rate.  Since  Ci  trivially  has  label  complexity 


1,  anc 

(Co.V)  is  known  to  be  learnable  at  an  exponentia 

rate 

e.g.. 

Bale 

an.  Broder,  and  Zhang, 

2007 

Dassupta 

2005 

Dasaupta.  Kalai.  and  Monteleoni 

2005 

Hanneke 

2007b 

]  combined  with 

Theoreml3.7l  this  would  imply  the  result. 


Below,  we  will  restrict  the  discussion  to  hypotheses  in  C3,  which  will  be  implicit  in  notation 
such  as  B(h,  r),  etc.  First  note  that,  to  show  Oh  <  00,  it  suffices  to  show  that 

liBSM<co,  (3.1) 

1 — >0  r 

so  we  will  focus  on  this. 

For  any  h,  there  exists  rh  >  0  s.t.  Mb'  E  B(h,r),W(h'(X)  =  +1)  <  1/2  P (h(X)  = 
+  1)  <  1/2,  or  in  other  words  the  minority  class  is  the  same  among  all  h'  E  B(h,r).  Now 
consider  any  h!  E  B(h,  r)  for  0  <  r  <  miiijr^,  \(h)/2}.  Clearly  P(/i(X)  f  h'{ X))  >  |A(/i)  — 
A(/i')|.  Suppose  h(x)  =  sign(w  ■  x  +  b)  and  h'{x)  =  sign(w'  ■  x  +  b')  (where,  without  loss, 
we  assume  ||tu||  =  1),  and  a(h,h')  E  [0, 7r]  is  the  angle  between  w  and  w'.  If  a(h,h')  = 
0  or  if  the  minority  regions  of  h  and  h!  do  not  intersect,  then  clearly  P (h(X)  f  h1  (X ) )  > 
min{A(/i), Otherwise,  consider  the  classifiers  h(x)  =  sign{w  ■  x +  b)  and  h'{x)  = 
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Figure  3.2:  Projection  of  h  and  h'  into  the  plane  defined  by  w  and  w' . 


sign(w'  ■  x  +  b'),  where  b  and  b'  are  chosen  s.t.  F(h(X)  =  +1)  =  F(h'(X)  =  +1)  and 
A  (h)  =  min  {A  (A).  X(h')}.  That  is,  h  and  b!  are  identical  to  h  and  h!  except  that  we  adjust  the 
bias  term  of  the  one  with  larger  minority  class  probability  to  reduce  its  minority  class  probability 
to  be  equal  to  the  other’s.  If  h  ^  h,  then  most  of  the  probability  mass  of  {x  :  h(x)  ^  h(x)}  is 
contained  in  the  majority  class  region  of  h!  (or  vice  versa  if  h'  ^  /;/),  and  in  fact  every  point  in 
{x  :  h(x)  7^  h(x)}  is  labeled  by  h  according  to  the  majority  class  label  (and  similarly  for  h!  and 
h').  Therefore,  we  must  have  P(/i(A")  ^  h'(X ))  >  ¥(h(X)  ^  h'(X)). 

We  also  have  that  P(/?.(X)  ^  h!{ A))  >  2a^h  )  A (h).  To  see  this,  consider  the  projection 
onto  the  2-dimensional  plane  defined  by  w  and  w',  as  in  Figure  [33~2l  Because  the  two  decision 
boundaries  must  intersect  inside  the  acute  angle,  the  probability  mass  contained  in  each  of  the 
two  wedges  (both  with  a(h,  h')  angle)  making  up  the  projected  region  of  disagreement  between  h 
and  h!  must  be  at  least  an  a(h,  //')  / 7r  fraction  of  the  total  minority  class  probability  for  the  respec¬ 
tive  classifier,  implying  the  union  of  these  two  wedges  has  probability  mass  at  least  2a^h  )  A (h). 
Thus,  we  have  P(/i(A)  ^  h\X ))  >  max  j|A(/i)  —  A(/i')|,  2ot(Xh  )  min{A(7i),  A(/i')}  j.  In  par¬ 
ticular, 

B(h,r)  C  fh':  max 

The  region  of  disagreement  of  this  set  is  at  most 


\X(h)-X(h% 


2 a(h,  h!) 


7 r 


min{A(/i),  A(/i')}  >  <  r 


DIS  ^  j/i'  :  2a^h'\\(h)  -  r)  <  r  A  \\{h)  -  \(ti)\  <  r 
C  DIS({ti  :  w'  =  wA\X(h')  —  X(h)\  <  r})UDIS({ti  :  a{h,  ti)  <  wr/X(h)A\X(h)-X(ti)\  =  r}), 
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where  this  last  line  follows  from  the  following  reasoning.  Take  ymaj  to  be  the  majority  class  of 
h  (arbitrary  if  A (h)  =  1/2).  For  any  b!  with  |A(/i)  —  X(h')\  <  r,  the  h"  with  a(h,  h")  =  a(h,  h') 
having  P(/i(A")  =  ymaj)  ~  P (h"(X)  =  ymaj )  =  r  disagrees  with  h  on  a  set  of  points  containing 
{x  :  h'(x)  7^  h(x)  =  ymaj}',  likewise,  the  one  having  F(h(X)  =  ymaj)-F(ti'(X)  =  ymaj)  =  -r 
disagrees  with  h  on  a  set  of  points  containing  { x  :  h\x)  7^  h(x)  =  — ymaj }•  So  any  point  in 
disagreement  between  h  and  some  h!  with  |A(/i)  —  A(/i')|  <  r  and  a(h,  h!)  <  nr/\(h)  is  also 
disagreed  upon  by  some  h"  with  \X(h)  —  X(h")\  =  r  and  a(h,  h")  <  irr/X(h). 

Some  simple  trigonometry  shows  that  DIS({ h'  :  a(h,h')  <  nr/X{h)  A \X(h)  —  X(h')\  =  r}) 
is  contained  in  the  set  of  points  within  distance  sin(irr / X(h))  <  nr/Xof  the  two  hyperplanes 
representing  h\{x)  =  sign(w  ■  x  +  bi)  and  h,2{x)  =  sign(w  ■  x  +  b2)  defined  by  the  property  that 
X{hi)  —  X (h)  =  X (h)  —  X (h2)  =  r,  so  that  the  total  region  of  disagreement  is  contained  within 


{x  :  h\{x)  7^  h2(x)}  U  {x  :  min { | w  ■  x  +  £>i | ,  \w  ■  x  +  b2\}  <  7rr/X(h)}. 


Clearly,  P({x  :  h\(x)  ^  h2(x)})  =  2 r.  Using  previous  results  [Balcan  et  al 


2006, 


Ffanneke. 


2007H1.  we  know  that  P({x  :  min{|u;  •  x  +  &i|,  \w  ■  x  +  b2\}  <  Trr/X(h)})  <  2ny/nr/X{h ) 

(since  the  probability  mass  contained  within  this  distance  of  a  hyperplane  is  maximized  when  the 
hyperplane  passes  through  the  origin).  Thus,  the  probability  of  the  entire  region  of  disagreement 
is  at  most  (2  +  2n y/n / X(h))r ,  so  that  (13.11)  holds,  and  therefore  the  disagreement  coefficient  is 
finite.  □ 


3.5.3  Composition  results 

We  can  also  extend  the  results  from  the  previous  subsection  to  other  types  of  distributions  and 
concept  classes  in  a  variety  of  ways.  Here  we  include  a  few  results  to  this  end. 

Close  distributions:  If  (C,  V)  is  leamable  at  an  exponential  rate,  then  for  any  distribution  D' 
such  that  for  all  measurable  A  C  X,  AP©(A)  <  Pxy(Al)  <  (l/A)Pp(A)  for  some  A  6  (0, 1], 
(C,  V)  is  also  learnable  at  an  exponential  rate.  In  particular,  we  can  simply  use  the  algorithm 
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Figure  3.3:  Illustration  of  the  proof  of  Theorem  13. 101  The  dark  gray  regions  represent  Bx>1(h\,  2 r)  and 
B"p2(h‘2, 2r).  The  function  h  that  gets  returned  is  in  the  intersection  of  these.  The  light  gray  regions 
represent  Bp1  (hi,  e/3)  and  lipfhp.  e/3).  The  target  function  h*  is  in  the  intersection  of  these.  We 
therefore  must  have  r  <  e/3,  and  by  the  triangle  inequality  er  (h)  <  e. 

for  (C,  V),  filter  the  examples  from  V  so  that  they  appear  like  examples  from  V,  and  then  any 
t  large  enough  to  find  an  eA-good  classifier  with  respect  to  V  is  large  enough  to  find  an  e-good 
classifier  with  respect  to  V . 

Mixtures  of  distributions:  Suppose  there  exist  algorithms  A\  and  A-2  for  learning  a  class  C  at 
an  exponential  rate  under  distributions  V\  and  V2  respectively.  It  turns  out  we  can  also  learn 
under  any  mixture  of  V i  and  D2  at  an  exponential  rate,  by  using  A\  and  A2  as  black  boxes. 
In  particular,  the  following  theorem  relates  the  label  complexity  under  a  mixture  to  the  label 
complexities  under  the  mixing  components. 

Theorem  3.10.  Let  C  be  an  arbitrary  hypothesis  class.  Assume  that  the  pairs  (C,  V 1)  and 
(C,  V 2)  have  label  complexities  Ai(e,  S,  h *)  and  A2(e,  S,  h *)  respectively,  where  V 1  and  V2  have 
density  functions  Vr^  and  V  r-p2  respectively.  Then  for  any  a  G  [0, 1],  the  pair 
(C,  aT>i  +  (1  —  a)P2)  has  label  complexity  at  most 
2  [max{Ai(e/3,  S/2,  h*),  A2(e/3,  S/2,  h*)}~ \. 

Proof.  If  a  =  0  or  1  then  the  theorem  statement  holds  trivially.  Assume  instead  that  a  G  (0, 1). 
We  describe  an  algorithm  in  terms  of  a,  V 1,  and  V2,  which  achieves  this  label  complexity  bound. 
Suppose  algorithms  Ai  and  A2  achieve  the  stated  label  complexities  under  V 1  and  P2  re- 
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spectively.  At  a  high  level,  the  algorithm  we  define  works  by  “filtering”  the  distribution  over 
input  so  that  it  appears  to  come  from  two  streams,  one  distributed  according  to  T>x,  and  one  dis¬ 
tributed  according  to  V2,  and  feeding  these  filtered  streams  to  A  \  and  A2  respectively.  To  do  so, 
we  define  a  random  sequence  u1}  u 2,  ■  ■  ■  of  independent  uniform  random  variables  in  [0, 1].  We 
then  run  A\  on  the  sequence  of  examples  x,  from  the  unlabeled  data  sequence  satisfying 

_ aVrVl{xi) _ 

Ul  <  aVrVl(xi)  +  (1  -  a)Vrx>2(xi)  ’ 

and  run  A2  on  the  remaining  examples,  allowing  each  to  make  an  equal  number  of  label  requests. 

Let  hi  and  h2  be  the  classifiers  output  by  A\  and  A2.  Because  of  the  filtering,  the  examples 
that  Ai  sees  are  distributed  according  to  Vx,  so  after  t/2  queries,  the  current  error  of  hi  with 
respect  to  T>i  is,  with  probability  1  —  S/2,  at  most  infle'  :  Ai(e'  ,5/2,  h*)  <  t/2}.  A  similar 
argument  applies  to  the  error  of  h2  with  respect  to  V2. 

Finally,  let 

r  =  inf{r  :  BVl(hx,r)  OBD2(h2,r)  ^  0}  , 

where 

BVi(hi,r )  =  {h  G  C  :  VrVi{h{x)  ^  hi(x))  <  r}  . 

Define  the  output  of  the  algorithm  to  be  any  h  e  B-p^hi,  2 r)  D  Bj)2{h 2,  2 r).  If  a  total  of  t  > 
2  [max{A1(e/3,  <5/2,  h*),  A2(e/3,  S/2,  (i*)}]  queries  have  been  made  (t/2  by  Ai  andf/2  by  A2), 
then  by  a  union  bound,  with  probability  at  least  1  —  S,  h*  is  in  the  intersection  of  the  e/3-balls, 
and  so  h  is  in  the  intersection  of  the  2e/3-balls.  By  the  triangle  inequality,  h  is  within  e  of  h* 
under  both  distributions,  and  thus  also  under  the  mixture.  (See  Figure  l3~3l  for  an  illustration  of 
these  ideas.)  □ 

3.5.4  Lower  Bounds 

Given  the  previous  discussion,  one  might  suspect  that  any  pair  (C,  V)  is  leamable  at  an  expo¬ 
nential  rate,  under  some  mild  condition  such  as  finite  VC  dimension.  However,  we  show  in  the 
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Figure  3.4:  A  learning  problem  where  exponential  rates  are  not  achievable.  The  instance  space 
is  an  infinite-depth  tree.  The  target  labels  nodes  along  a  single  infinite  path  as  +1,  and  labels  all 
other  nodes  —1.  For  any  0(e)  =  o(l/e),  when  the  number  of  children  and  probability  mass  of 
each  node  at  each  subsequent  level  are  set  in  a  certain  way,  label  complexities  of  o(0(e))  are  not 
achievable  for  all  targets. 

following  that  this  is  not  the  case,  even  for  some  simple  geometric  concept  classes  when  the 
distribution  is  especially  nasty. 

Theorem  3.11.  For  any  positive  function  0(e)  =  o(l/e),  there  exists  a  pair  (C,  V),  with  the  VC 
dimension  of  C  equal  1,  such  that  for  any  achievable  label  complexity  A(e,  5,  h )  for  (C,  V),  for 
any  5  e  (0, 1/4), 

3h  E  C  s.t.  A(e,  5,  h)  o(0(e)). 

In  particular,  taking  0(e)  =  1/a/c  (for  example),  this  implies  that  there  exists  a  (C,  V)  that  is 
not  learnable  at  an  exponential  rate  (in  the  sense  of  Definition  13.31). 

Proof.  If  we  can  prove  this  for  any  such  0(e)  f  0(1),  then  clearly  this  would  imply  the  result 
holds  for  0(e)  =  0(1)  as  well,  so  we  will  focus  on  0(e)  7^  0(1)  case.  Let  T  be  a  fixed  infinite 
tree  in  which  each  node  at  depth  i  has  c,  children;  c,  is  defined  shortly  below.  We  consider 
learning  the  hypothesis  class  C  where  each  h  e  C  corresponds  to  a  path  down  the  tree  starting 
at  the  root;  every  node  along  this  path  is  labeled  1  while  the  remaining  nodes  are  labeled  —1. 
Clearly  for  each  h  e  C  there  is  precisely  one  node  on  each  level  of  the  tree  labeled  1  by  h  (i.c. 
one  node  at  each  depth).  C  has  VC  dimension  1  since  knowing  the  identity  of  the  node  labeled  1 
on  level  i  is  enough  to  determine  the  labels  of  all  nodes  on  levels  0, . . . ,  i  perfectly.  This  learning 
problem  is  depicted  in  Figure l3~4l 
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Now  we  define  T>,  a  “bad”  distribution  for  C.  Let  1  be  any  sequence  of  positive  numbers 
s.t.  =  1-  wiU  bound  the  total  probability  of  all  nodes  on  level  i  according  to  V. 

Assume  all  nodes  on  level  i  have  the  same  probability  according  to  V.  and  call  this  p, .  We  define 
the  values  of  pt  and  ct  recursively  as  follows.  For  each  i  >  1,  we  define  p,  as  any  positive  number 
s.t.  Pi\(p{pi)]  nj=o °j  <  z  i  and  o{pi)  >  4,  and  define  q_i  =  \4>{pi)].  We  are  guaranteed  that 
such  a  value  of  Pi  exists  by  the  assumptions  that  0(e)  =  o(l/e),  meaning  lime^0  e0(e)  =  0,  and 
that  0(e)  7^  0(1).  Letting  p0  =  1  —  X^>i  Pi  FI  ;=o  cj  completes  the  definition  of  V. 


With  this  definition  of  the  parameters  above,  since  J2iPi  —  1>  we  know  that  for  any  eo  >  0, 
there  exists  some  e  <  e0  such  that  for  some  level  j,  pj  =  e  and  thus  c3-\  >  o(p:])  =  0(e). 
We  will  use  this  fact  to  show  that  oc  0(e)  labels  are  needed  to  learn  with  error  less  than  e  for 
these  values  of  e.  To  complete  the  proof,  we  must  prove  the  existence  of  a  “difficult”  target 
function,  customized  to  challenge  the  particular  learning  algorithm  being  used.  To  accomplish 
this,  we  will  use  the  probabilistic  method  to  prove  the  existence  of  a  point  in  each  level  i  such 
that  any  target  function  labeling  that  point  positive  would  have  a  label  complexity  >  0(p<)/4. 
The  difficult  target  function  simply  strings  these  points  together. 


To  begin,  we  define  x0  =  the  root  node.  Then  for  each  i  >  1,  recursively  define  xt  as 
follows.  Suppose,  for  any  h,  the  set  Rh  and  the  classifier  ///,  are,  respectively,  the  random  variable 
representing  the  set  of  examples  the  learning  algorithm  would  request,  and  the  classifier  the 
learning  algorithm  would  output,  when  h  is  the  target  and  its  label  request  budget  is  set  to  t  = 
L0(p0/2J .  For  any  node  x,  we  will  let  Children(x)  denote  the  set  of  children  of  x,  and  Subtree (.x) 
denote  the  set  of  x  along  with  all  descendants  of  x.  Additionally,  let  hx  denote  any  classifier  in 
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C  s.t.  hx(x)  =  +1.  Now  note  that 


max  inf  P {¥v(h(X)  ^  hh(X))  >  pt} 

#EChildren(:rj_i)  h£C.\h(x)=-\- 1 


> 


£ 


_  inf  ,  W{Mh(X)  /  hh(X))  >  ft} 

Co— I  z '  /iGC:fo(:r)=+l 

a:GChildren(^i_i) 


> 


Q-i 


F{v/z  e  C  :  h{x)  =  +1,  Subtree(x)  n  Rh  =  0  A  Pc(/i(X)  ^  /rh(X))  > 


=  E 


>  E 


fcGChildren(fCi_i) 

1 

i 

fcGChildren(a;i_i):Subtree(a;)n-R/la;  =0 

1 


Q-i 


£ 


I 


nun 


x' G Children (xi—i)  C{— \ 


£ 


V/i  G  C  :  h(x)  =  +1,  P© 
I  [x7  7^  x] 


¥=hh(X))  >p 


x  GChildren  (x{  _  i ) :  Subtree  (rr )  flR/^  =0 


>  —  (ci-i  -  f  - 1)  =  |  ,/X  N,  (L0(Pi)J  -  L0fe)/2J  - 1)  >  |  j}  Ni  (L0(Pt)J/2  - 1)  >  V4- 

Ci-i  L0(Pi)J  L0(Pi)J 


The  expectations  above  are  over  the  unlabeled  examples  and  any  internal  random  bits  used  by  the 
algorithm.  The  above  inequalities  imply  there  exists  some  x  G  Children (xj_i)  such  that  every 
fee  C  that  has  h(x)  =  +1  has  A(pi,  6,  h)  >  [4>{pi)/ 2J  >  <f>(pi)/ 4;  we  will  take  x*  to  be  this 
value  of  x.  We  now  simply  take  the  target  function  h*  to  be  the  classifier  that  labels  x*  positive  for 
all  i,  and  labels  every  other  point  negative.  By  construction,  we  have  Vi,  A (ph  5,  h*)  >  4>{pi)/ 4, 
and  therefore 


Ve0  >  0,  3e  <  e0  :  A(e,  5,  h*)  >  </>(e)/4, 


so  that  A(e,  <5,  /i*)  7^  o(</>(e)). 


□ 


Note  that  this  implies  that  the  o  (1/e)  guarantee  of  Corollary  13 .61  is  in  some  sense  the  tightest 
guarantee  we  can  make  at  that  level  of  generality,  without  using  a  more  detailed  description  of 
the  structure  of  the  problem  beyond  the  finite  VC  dimension  assumption. 

This  type  of  example  can  be  realized  by  certain  nasty  distributions,  even  for  a  variety  of 
simple  hypothesis  classes:  for  example,  linear  separators  in  M2  or  axis-aligned  rectangles  in  M2. 
We  remark  that  this  example  can  also  be  modified  to  show  that  we  cannot  expect  intersections 
of  classifiers  to  preserve  exponential  rates.  That  is,  the  proof  can  be  extended  to  show  that  there 
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exist  classes  Ci  and  C2,  such  that  both  (Ci .  D)  and  (C2,  V)  are  leamable  at  an  exponential  rate, 
but  (C,  D)  is  not,  where  C  =  {hi  fl  h2  :  hi  e  Ci,  h2  £  C2}. 


3.6  Discussion  and  Open  Questions 


The  implication  of  our  analysis  is  that  in  many  interesting  cases  where  it  was  previously  believed 
that  active  learning  could  not  help,  it  turns  out  that  active  learning  does  help  asymptotically. 
We  have  formalized  this  idea  and  illustrated  it  with  a  number  of  examples  and  general  theorems 
throughout  the  chapter.  This  realization  dramatically  shifts  our  understanding  of  the  usefulness 
of  active  learning:  while  previously  it  was  thought  that  active  learning  could  not  provably  help 
in  any  but  a  few  contrived  and  unrealistic  learning  problems,  in  this  alternative  perspective  we 
now  see  that  active  learning  essentially  always  helps,  and  does  so  significantly  in  all  but  a  few 
contrived  and  unrealistic  problems. 

The  use  of  decompositions  of  C  in  our  analysis  generates  another  interpretation  of  these 


results.  Specifically, 


Dasgupta  1 2005]  posed  the  question  of  whether  it  would  be  useful  to  de¬ 


velop  active  learning  techniques  for  looking  at  unlabeled  data  and  “placing  bets”  on  certain 
hypotheses.  One  might  interpret  this  work  as  an  answer  to  this  question;  that  is,  some  of  the 
decompositions  used  in  this  chapter  can  be  interpreted  as  reflecting  a  preference  partial-ordering 


Shawe-Tavlor  et  al.. 

1998, 

Vapnik, 

1998] 

19981.  However,  the  construction  of  a  good  decomposition 


in  active  learning  seems  more  subtle  and  quite  different  from  previous  work  in  the  context  of 
supervised  or  semi-supervised  learning. 

It  is  interesting  to  examine  the  role  of  target-  and  distribution-dependent  constants  in  this 
analysis.  As  defined,  both  the  verifiable  and  true  label  complexities  may  depend  heavily  on  the 
particular  target  function  and  distribution.  Thus,  in  both  cases,  we  have  interpreted  these  quan¬ 
tities  as  fixed  when  studying  the  asymptotic  growth  of  these  label  complexities  as  e  approaches 
0.  It  has  been  known  for  some  time  that,  with  only  a  few  unusual  exceptions,  any  target-  and 
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distribution-independent  bound  on  the  verifiable  label  complexity  could  typically  be  no  better 


than  the  label  complexity  of  passive  learning;  in  particular,  this  observation  lead  Dasgupta  to  for- 


mulate  his  splitting  index  bounds  as  both  target-  and  distribution-dependent  [Dasgupta, 


200511 . 


This  fact  also  applies  to  bounds  on  the  true  label  complexity  as  well.  Indeed,  the  entire  distinc¬ 
tion  between  verifiable  and  true  label  complexities  collapses  if  we  remove  the  dependence  on 
these  unobservable  quantities. 

One  might  wonder  what  the  practical  implications  of  the  true  label  complexity  of  active  learn¬ 
ing  might  be  since  the  theoretical  improvements  we  provide  are  for  an  unverifiable  complexity 
measure  and  therefore  they  do  not  actually  inform  the  user  (or  algorithm)  of  how  many  labels 
to  allow  the  algorithm  to  request.  However,  there  might  still  be  implications  for  the  design  of 
practical  algorithms.  In  some  sense,  this  is  the  same  issue  faced  in  the  analysis  of  universally 


consistent  learning  rules  in  passive  learning  IDevrove  et  al.. 


19961.  There  is  typically  no  way  to 


verify  how  close  to  the  Bayes  error  rate  a  classifier  is  (verifiable  complexity  is  infinite),  yet  we 
still  want  learning  rules  whose  error  rates  provably  converge  to  the  Bayes  error  in  the  limit  (true 
complexity  is  a  finite  function  of  epsilon  and  the  distribution  of  (X,  Y)),  and  we  often  find  such 
methods  quite  effective  in  practice  (e.g.,  /c-nearest  neighbor  methods).  So  this  is  one  instance 
where  an  unverifiable  label  complexity  seems  to  be  a  useful  guide  in  algorithm  design.  In  active 
learning  with  finite-complexity  hypothesis  classes  we  are  more  fortunate,  since  the  verifiable 
complexity  is  finite  -  and  we  certainly  want  algorithms  with  small  verifiable  label  complexity; 
however,  an  analysis  of  unverifiable  complexities  still  seems  relevant,  particularly  when  the  veri¬ 
fiable  complexity  is  large.  In  general,  it  seems  desirable  to  design  algorithms  for  any  given  active 
learning  problem  that  achieve  both  a  verifiable  label  complexity  that  is  near  optimal  and  a  true 
label  complexity  that  is  asymptotically  better  than  passive  learning. 


Open  Questions:  There  are  many  interesting  open  problems  within  this  framework.  Perhaps 
the  most  interesting  of  these  would  be  formulating  general  necessary  and  sufficient  conditions 
for  leamability  at  an  exponential  rate,  and  determining  for  what  types  of  algorithms  Theoreml3.5l 
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can  be  extended  to  the  agnostic  case  or  to  infinite  capacity  hypothesis  classes.  We  will  discuss 
some  progress  on  this  latter  problem  in  the  next  chapter. 


3.7  The  Verifiable  Label  Complexity  of  the  Empty  Interval 


Let  h_  denote  the  all-negative  interval.  In  this  section,  we  lower  bound  the  verifiable  labels 
complexities  achievable  for  this  classifier,  with  respect  to  the  hypothesis  class  C  of  interval  clas¬ 
sifiers  under  a  uniform  distribution  on  [0, 1].  Specifically,  suppose  there  exists  an  algorithm  A 
that  achieves  a  verifiable  label  complexity  A(e,  5,  h )  such  that  for  some  e  6  (0, 1/4)  and  some 
5  e  (0, 1/4), 


A(e,5,h_)  < 


1 

24e 


We  prove  that  this  would  imply  the  existence  of  some  interval  h'  for  which  the  value  of  A(e,  5,  h') 
is  not  valid  under  Definition  13 .21  We  proceed  by  the  probabilistic  method. 

Consider  the  subset  of  intervals 


[3ie,  3(y  +  l)e]  :  i  <E  |0, 1, . . . , 

Let  s  =  [A(e,  <5,  //_)] .  For  any  /  e  C,  let  Rj ,  hf,  and  if  denote  the  random  variables  repre¬ 
senting,  respectively,  the  set  of  examples  ( x ,  y )  for  which  A(s,  5)  requests  labels  (including  their 
y  =  f(x)  labels),  the  classifier  A(s,  <5)  outputs,  and  the  confidence  bound  A(s,  <5)  outputs,  when 
/  is  the  target  function.  Let  I  be  an  indicator  function  that  is  1  if  its  argument  is  true  and  0 
otherwise.  Then 
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maxP(P x  (hf(X)  ?  f(X))  >  e, 


>  7771  E  P  (P*  fe  W  *  /  W)  >  V 


>  7^7  E  P  ((ft  =  ^-)  A  (P*  feW  ^  /W)  >  «/)) 


Iff 

1 

=  E 

>  E 

=  E 

>  E 


/eft 


feH, 

1 

We 

1 

w 

1 

w 


Y  I  \Px  (hf(x)  +  /p 0)  >  h 

feHe:Rf=Rh_ 


£  « 

f£He:Rf=Rh_ 


P 


X 


=  +1  <e)  A  (if  <e 


Y  1  [W  WW)  +  h_(X)j  <  ej  A  (eh_  <  e) 

feHe:Rf=Rh_ 


\H< 

|  - 

-  s 

He\ 

I 


P x(hh_(X)^h-(X))  <6h_  <6 


\He\-s 
I  HA 


> 


I  Ht 

|  - 

-  s 

He  | 

P  (P*  [hh_( X)  fe  h-(X)  )<eh_<  e 
(1  -  5)  >  6. 


(3.2) 

(3.3) 

(3.4) 


All  expectations  are  over  the  draw  of  the  unlabeled  examples  and  any  additional  random  bits 
used  by  the  algorithm.  Line  13.21  follows  from  the  fact  that  all  intervals  /  e  He  are  of  width 
3e,  so  if  hf  labels  less  than  a  fraction  e  of  the  points  as  positive,  it  must  make  an  error  of  at 
least  2e  with  respect  to  /,  which  is  more  than  if  if  if  <  e.  Note  that,  for  any  fixed  sequence  of 
unlabeled  examples  and  additional  random  bits  used  by  the  algorithm,  the  sets  Rf  are  completely 
determined,  and  any  /  and  f  for  which  Rf  =  Rf  must  have  hf  =  hf  and  if  —  if .  In 
particular,  any  /  for  which  Rf  —  Rh_  will  yield  identical  outputs  from  the  algorithm,  which 
implies  line  13.31  Furthermore,  the  only  classifiers  /  e  He  for  which  Rf  fe  Rh_  are  those  for 
which  some  (x,  —1)  G  Rh_  has  f(x)  =  +1  (i.e.,  x  is  in  the  /  interval).  But  since  there  is  zero 
probability  that  any  unlabeled  example  is  in  more  than  one  of  the  intervals  in  He,  with  probability 
1  there  are  at  most  s  intervals  /  e  He  with  R  f  fe  Rh_,  which  explains  line  13 .41 

This  proves  the  existence  of  some  target  function  h*  e  C  such  that  P (er(hSi$)  >  efefe  >  fe 
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which  contradicts  the  conditions  of  Definition  |3T2j 


3.8  Proof  of  Theorem  1X71 

First  note  that  the  total  number  of  label  requests  used  by  the  aggregation  procedure  in  Algorithm 
4is  at  most  t.  Initially  running  the  algorithms  A1, . . . ,  Ak  requires  Yli=]  [t/ (4/'2)J  <  t/ 2  labels, 
and  the  second  phase  of  the  algorithm  requires  k2  [72  ln(4/c/5)]  labels,  which  by  definition  of  k 
is  also  less  than  t/2.  Thus  this  procedure  is  a  valid  learning  algorithm. 

Now  suppose  that  the  true  target  h*  is  a  member  of  Q.  We  must  show  that  for  any  input  t 
such  that 

t  >  max  {Ai2  |" At  (e/2,  6/2,  h*)]  ,  2i2  [72  ln(4i/5)l }  , 

the  aggregation  procedure  outputs  a  hypothesis  ht  such  that  er(ht )  <  e  with  probability  at  least 
1-8. 

First  notice  that  since  t  >  2 i2  [72  ln(4 i/8)],  k  >  i.  Furthermore,  since  t/(Ai2)  > 

[Aj  (e/2,  8/2,  h*)],  with  probability  at  least  1—8/2,  running  Ai([t  /  (Ai2)  \ ,  8/2 )  returns  a  function 
hi  with  er(hi)  <  e/2. 

Let  j*  =  argrriirij  er(hj).  Since  er(hrk)  <  er(hi)  for  any  l,  we  would  expect  hj*  to  make  no 
more  errors  that  hg  on  points  where  the  two  functions  disagree.  It  then  follows  from  Hoeffding’s 
inequality,  with  probability  at  least  1  —  5/4,  for  all  £, 

mjH  <  ^  [72  In  (Ak/ 5)]  , 

and  thus 

7 

min  max  <  —  [72  ln(4/c/5)]  . 

Similarly,  by  Hoeffding’s  inequality  and  a  union  bound,  with  probability  at  least  1  —  5/4,  for  any 
i  such  that 

mij*  <  [72  ln(4/c/ 5)]  , 
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the  probability  that  hi  mislabels  a  point  x  given  that  he(x)  hj*(x )  is  less  than  2/3,  and  thus 
er(hi)  <  2er(hj*).  By  a  union  bound  over  these  three  events,  we  find  that,  as  desired,  with 
probability  at  least  1  —  5, 

er(ht )  <  2 er(hj*)  <  2 er(hi)  <  e  . 


3.9  Proof  of  Theorem  I37H1 

Assume  that  (C,  V)  is  leamable  at  an  exponential  rate.  This  means  that  there  exists  an  algorithm 
A  such  that  for  any  target  h*  in  C,  there  exist  constants  y/,*  and  kh *  such  that  for  any  e  and  5,  for 
any  t  >  ^h*  (log(l/(e5)))fc,t‘ ,  with  probability  at  least  1  —  5,  after  t  label  requests,  A(t,  5)  outputs 
an  e-good  classifier. 

For  each  i,  let 


Ci  =  {h  e  C  :  7/,  <  i,  kh  <  i}  . 

Define  an  algorithm  At  that  achieves  the  required  poly  log  verifiable  label  complexity  on  (Q,  V) 
as  follows.  First,  run  the  algorithm  A  to  obtain  a  function  h\-  Then,  output  the  classifier  in  C, 
that  is  closest  to  Ha,  i-e.,  the  classifier  that  minimizes  the  probability  of  disagreement  with  h,\ .  If 
t  >  i(log  (2/(e5)))\  then  after  t  label  requests,  with  probability  at  least  1  —  5,  A(t,  5)  outputs  an 
e/2-good  classifier,  so  by  the  triangle  inequality,  with  probability  at  least  1  —  5,  Ai(t,  5)  outputs 
an  e-good  classifier. 

It  can  be  guaranteed  that  with  probability  at  least  1  —  5,  the  function  output  by  Ai  has  error 
no  more  than  it  =  (2/5)  exp  {  —  (t/i)1/*},  which  is  no  more  than  e,  implying  that  the  expression 
above  is  a  verifiable  label  complexity. 

Combining  this  with  Theorem  13 .71  yields  the  desired  result. 
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3.10  Heuristic  Approaches  to  Decomposition 


As  mentioned,  decomposing  purely  based  on  verifiable  complexity  with  respect  to  (C,  V)  typ¬ 
ically  cannot  yield  a  good  decomposition  even  for  very  simple  problems,  such  as  unions  of 
intervals.  The  reason  is  that  the  set  of  classifiers  with  high  verifiable  label  complexity  may  itself 
have  high  verifiable  complexity. 

Although  we  have  not  yet  found  a  general  method  that  can  provably  always  find  a  good 
decomposition  when  one  exists  (other  than  the  trivial  method  in  the  proof  of  Theorem  13 .81).  we 
find  that  a  heuristic  recursive  technique  is  frequently  effective.  To  begin,  define  Ci  =  C.  Then 
for  i  >  1,  recursively  define  Q  as  the  set  of  all  h  e  Q_i  such  that  9h  —  oo  with  respect  to 
(Q_i,  V).  (Here  9h  is  the  disagreement  coefficient  of  h.)  Suppose  that  for  some  N,  CN+i  =  0. 
Then  for  the  decomposition  Ci,  C2, . . . ,  CN,  every  h  E  C  has  9h<  oo  with  respect  to  at  least  one 
of  the  sets  in  which  it  is  contained,  which  implies  that  the  verifiable  label  complexity  of  h  with 
respect  to  that  set  is  0(polylog(l/e5)),  and  the  aggregation  algorithm  can  be  used  to  achieve 
polylog  label  complexity. 


We  could  alternatively  perform  a  similar  decomposition  using  a  suitable  definition  of  splitting 


index  [Dasgupta, 


2005].  or  more  generally  using 


lim  sup 

e—>0 


Ac^M,  h) 

(Mi))* 


for  some  fixed  constant  k  >  0. 

This  procedure  does  not  always  generate  a  good  decomposition.  However,  if  N  <  oo  exists, 
then  it  creates  a  decomposition  for  which  the  aggregation  algorithm,  combined  with  an  appropri¬ 
ate  sequence  of  algorithms  {A},  could  achieve  exponential  rates.  In  particular,  this  is  the  case 
for  all  of  the  (C,  V)  described  in  Section  1331  In  fact,  even  if  N  =  oo,  as  long  as  every  h  E  C 
does  end  up  in  some  set  Q  for  finite  i,  this  decomposition  would  still  provide  exponential  rates. 
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3.11  Proof  of  Theorem  13.51 


We  now  finally  prove  Theorem  13.51  This  section  is  mostly  self-contained,  though  we  do  make 
use  of  Theorem  ITT]  from  Section  lT4l  in  the  final  step  of  the  proof. 

The  proof  proceeds  according  to  the  following  outline.  We  begin  in  Lemma  13.121  by  de¬ 
scribing  special  conditions  under  which  a  CAL-like  algorithm  has  the  property  that  the  more 
unlabeled  examples  it  considers,  the  smaller  the  fraction  of  them  it  asks  to  be  labeled.  Since 
CAL  is  able  to  identify  the  target’s  true  label  on  any  example  it  considers  (either  the  label  of 
the  example  is  requested  or  the  example  is  not  in  the  region  of  disagreement  and  therefore  the 
label  is  already  known),  we  end  up  with  a  set  of  labeled  examples  growing  strictly  faster  than  the 
number  of  label  requests  used  to  obtain  it.  This  set  of  labeled  examples  can  be  used  as  a  training 
set  in  any  passive  learning  algorithm.  However,  the  special  conditions  under  which  this  happens 
are  rather  limiting.  In  Lemma  13.131  we  exploit  a  subtle  relation  between  overlapping  boundary 
regions  and  shatterable  sets  to  show  that  we  can  decompose  any  finite  VC  dimension  class  into  a 
countable  number  of  subsets  satisfying  these  special  conditions.  This,  combined  with  the  aggre¬ 
gation  algorithm,  and  a  simple  procedure  that  boosts  the  confidence  level,  extends  Lemma  13.121 
to  the  general  conditions  of  Theorem  13 .51 

Before  jumping  into  Lemma  13.121  it  is  useful  to  define  some  additional  notation.  For  any 
V  C  C  and  h  e  C,  define  the  boundary  of  h  with  respect  to  V  and  V,  denoted  dyh,  as 

dyh  =  lim  DIS(L>y(/i,  r)). 

r— >0 

Lemma  3.12.  Suppose  (C,  V)  is  such  that  C  has  finite  VC  dimension  d,  and 
\/h  G  C,  P(df:h)  =  0.  Then  for  any  passive  learning  label  complexity  Ap(e,  5 ,  h)  for  (C,  V) 
which  is  nondecreasing  as  e  — >  0,  there  exists  an  active  learning  algorithm  achieving  a  label 
complexity  A  a(e,  5,  h )  such  that,  for  any  5  >  0  and  any  target  function  Id  £  C  with 
Ap(e,  5,  h *)  =  cu(l)  andVe  >  0,Ap(e,  5,  h*)  <  oo, 

Aa(e,  25,  h *)  =  o(Ap(e,  5,  h*))  . 
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Proof.  Recall  that  t  is  the  “budget”  of  the  active  learning  algorithm,  and  our  goal  in  this  proof  is 
to  define  an  active  learning  algorithm  Aa  and  a  function  Aa(e,  5,  h*)  such  that,  if  t  >  Aa(e,  5 ,  h*) 
and  h*  E  C  is  the  target  function,  then  Aa(t,  5)  will,  with  probability  1  —  <5,  output  an  e-good 
classifier;  furthermore,  we  require  that  Aa(e,  25,  h*)  =  o( Ap(e,  5,  h* ) )  under  the  conditions  on  h* 
in  the  lemma  statement. 

To  construct  this  algorithm,  we  perform  the  learning  in  two  phases.  The  first  is  a  passive 
phase,  where  we  focus  on  reducing  a  version  space,  to  shrink  the  region  of  disagreement;  the 
second  is  a  phase  where  we  construct  a  labeled  training  set,  which  is  much  larger  than  the  number 
of  label  requests  used  to  construct  it  since  all  classifiers  in  the  version  space  agree  on  many  of 
the  examples’  labels. 

To  begin  the  first  phase,  we  simply  request  the  labels  of  x\,  x2, . . . ,  x\t/2\ ,  and  let 

V  =  {h  E  C  :  \/i  <  \t/2\ ,  h(xi )  =  h*(xi)}  . 


In  other  words,  V  is  the  set  of  all  hypotheses  in  C  that  correctly  label  the 


By  standard  consistency  results  [Blumer  et  al 


1989, 


Devrove  et  al 


1996, 


irst  \t/2  \  examples. 


Vapnik. 


1982],  there 


is  a  universal  constant  c  >  0  such  that,  with  probability  at  least  1  —  5/2, 

/  d  In  t  +  In  i 
sup  er(h)  <  c  I  - - - - 


h&V 


This  implies  that 

V  CBt[h*,c 

and  thus  P(DIS(V))  <  At  where 


dint  +  In  | 
t 


A, 


=  pn 


DIS  Bt  h* ,  c 


dint  +  In  | 


Clearly,  At  goes  to  0  as  t  grows,  by  the  assumption  on  F(d^h*). 

Next,  in  the  second  phase  of  the  algorithm,  we  will  actively  construct  a  set  of  labeled  exam¬ 
ples  to  use  with  the  passive  learning  algorithm.  If  ever  we  have  P(DIS(V))  =  0  for  some  finite 
t,  then  clearly  we  can  return  any  h  E  V,  so  this  case  is  easy. 
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Otherwise,  let  nt  =  |_f/(24P(DIS(V))  ln(4/5))J,  and  suppose  t  >  2.  By  a  Chernoff  bound, 
with  probability  at  least  1  —  <5/2,  in  the  sequence  of  examples  X|_t/2j+i,  £[4/2] +2,  •  •  • ,  x\t/ 2j+nt>  at 
most  t/2  of  the  examples  are  in  DIS(V).  If  this  is  not  the  case,  we  fail  and  output  an  arbitrary  h\ 
otherwise,  we  request  the  labels  of  every  one  of  these  nt  examples  that  are  in  DIS(V). 

Now  construct  a  sequence  C  =  {(x[,  y[),  (x'2,  y'2), . . . ,  (x'  ,  y'nt)}  of  labeled  examples  such 
that  x\  =  x\t/2}+i,  and  y'i  is  either  the  label  agreed  upon  by  all  the  elements  of  V,  or  it  is 
the  h*(x\t/2]+i)  label  value  we  explicitly  requested.  Note  that  because  inf heVer(h)  =  0  with 
probability  1,  we  also  have  that  with  probability  1  every  y[  =  h*{x[).  We  may  therefore  use 
these  nt  examples  as  iid  training  examples  for  the  passive  learning  algorithm. 

Suppose  A  is  the  passive  learning  algorithm  that  guarantees  Ap(e,  5,  h )  passive  label  complex¬ 
ities.  Then  let  ht  be  the  classifier  returned  by  A(C ,  5).  This  is  the  classifier  the  active  learning 
algorithm  outputs. 

Note  that  if  nt  >  Ap(e,  <5,  h*),  then  with  probability  at  least  1  —  5  over  the  draw  of  C,  er(ht )  < 
e.  Define 

Aa(e,  25,  h *)  =  1  +  inf  {s  :  s  >  1441n(4/5)Ap(e,  5,  . 


This  is  well-defined  when  Ap(e,  5,  h*)  <  00  because  As  is  nonincreasing  in  s,  so  some  value  of  s 
will  satisfy  the  inequality.  Note  that  if  t  >  Aa(e,  25,  h*),  then  (with  probability  at  least  1  —  5/2) 


Ap(e,  5,  h*)  < 


<  nt 


144  ln(4/5)At 

So,  by  a  union  bound  over  the  possible  failure  events  listed  above  (5/2  for  P(DIS(15))  >  At,  5/2 
for  more  than  t/2  examples  of  C  in  DIS(V),  and  5  for  er(ht )  >  e  when  the  previous  failures  do 
not  occur),  if  t  >  Aa(e,  25,  h*),  then  with  probability  at  least  1  —  25,  er(ht )  <  e.  So  Aa(e,  5,  h *) 
is  a  valid  label  complexity  function,  achieved  by  the  described  algorithm.  Furthermore, 


Aa(e,  25,  h*)<  1  +  144  ln(4/5)Ap(e,  5,  h*)  AAa^25^_2. 

If  Aa(e,  25,  h*)  =  0(1),  then  since  Ap(e,  5,  h*)  =  cu(l),  the  result  is  established.  Otherwise,  since 
Aa(e,  5,  h*)  is  nondecreasing  as  e  — >  0,  Aa(e,  25,  h*)  =  cu(l),  so  we  know  that  AAa(e,2<5,h*)-2  = 
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o(l).  Thus,  Aa(e,  25,  h*)  =  o  (Ap(e,  5,  h*)). 


□ 


As  an  interesting  aside,  it  is  also  true  (by  essentially  the  same  argument)  that  under  the 
conditions  of  Lemma  13.121  the  verifiable  label  complexity  of  active  learning  is  strictly  smaller 
than  the  verifiable  label  complexity  of  passive  learning  in  this  same  sense.  In  particular,  this 
implies  a  verifiable  label  complexity  that  is  o  (1/e)  under  these  conditions.  For  instance,  with 
some  effort  one  can  show  that  these  conditions  are  satisfied  when  the  VC  dimension  of  C  is  1, 
or  when  the  support  of  V  is  at  most  countably  infinite.  However,  for  more  complex  learning 
problems,  this  condition  will  typically  not  be  satisfied,  and  as  such  we  require  some  additional 
work  in  order  to  use  this  lemma  toward  a  proof  of  the  general  result  in  Theoreml3.51  Toward  this 
end,  we  again  turn  to  the  idea  of  a  decomposition  of  C,  this  time  decomposing  it  into  subsets 
satisfying  the  condition  in  Lemma  13.121 

Lemma  3.13.  For  any  (C,  V)  where  C  has  finite  VC  dimension  d,  there  exists  a  countably 
infinite  sequence  Ci,  C2,  •  •  •  such  that  C  =  U“1C*  and 'Vi ,  VA  £  Q,  F(d^.h)  =  0. 

Proof.  The  case  of  d  =  0  is  clear,  so  assume  d  >  0.  A  decomposition  procedure  is  given  below. 
We  will  show  that,  if  we  let  HI  =  Decompose(C),  then  the  maximum  recursion  depth  is  at  most 
d  (counting  the  initial  call  as  depth  0).  Note  that  if  this  is  true,  then  the  lemma  is  proved,  since 
it  implies  that  HI  can  be  uniquely  indexed  by  a  ('/-tuple  of  integers,  of  which  there  are  at  most 
countably  many. 

Algorithm  2  Decompose  (Ft) 

Let  Hoc  =  {h  e  Ft  :  P(<9^A)  =  0} 

if  Hoo  =  7 H  then 

Return  {PL} 
else 

For  i  £  {1,2,...},  let  PCt  =  {hePL  :  P(<9^A)  £  ((1  +  2~^d+3^  (1  +  2~(rf+3) )  1~i] } 

Return  (J  Decompose^*)  U  {PLof} 

*6(1,2,...} 

end  if 
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For  the  sake  of  contradiction,  suppose  that  the  maximum  recursion  depth  of  Decompose (C) 
is  more  than  d  (or  is  infinite).  Thus,  based  on  the  first  d+1  recursive  calls  in  one  of  those  deepest 
paths  in  the  recursion  tree,  there  is  a  sequence  of  sets 

c  =  n{ :o)  d  n{1)  d  n (2)  d  •  ■  •  n(d+1)  ±  0 

and  a  corresponding  sequence  of  finite  positive  integers  ii,  i2,  ■  ■  ■ ,  id+i  such  that  for  each  j  G 
{1,  2, . . . ,  d  +  1},  every  h  G  has 

P(0£o_u h)  G  ((1  +  2~(d+3))-^',  (1  +  2-(d+3))1^]  . 

Take  any  hd+ 1  G  There  must  exist  some  r  >  0  such  that  Vj  G  {1,2 , . . .  ,d+  1}, 

P(DIS(57^y_1)  (hd+i,  r)))  G  ((l  +  2-(d+3))-i((l  +  2-(d+2))(l  +  2-(d+3))-^].  (3.5) 

In  particular,  by  (13.51).  each  h  G  (hd+i ,  r/2)  has 

P (dau-vh)  >  (1  +  2~('d+3'))~ij  >  (l  +  2-(d+2))-1P (DISiB^ihd+^r))), 

though  by  definition  of  h  and  the  triangle  inequality, 

¥(diiu-i)h\DIS(BiiU-i)(hd+i,r)))  =  0. 

Recall  that  in  general,  for  sets  Q  and  Ru  R2 , . . . ,  Rk,  if  P(.R;  \  Q)  —  0  for  all  i,  then  P(P)j  R,,)  > 
P(Q)  — ^^=i(P(<5)— P(i?i)).  Thus,  for  any  j,  any  set  of  <  2d+1  classifiers  T  C  B,^  (hd+i,  r/2) 
must  have 

nnhGTdnu-i)h)  >  (1  -  2d+1(l  -  (1  +  2-^)-1)MDIS(Biiu.1)(hd+1,r)))  >  0. 

That  is,  any  set  of  2d+l  classifiers  in  Ti^  within  distance  r/2  of  hd+ 1  will  have  boundaries  with 
respect  to  which  have  a  nonzero  probability  overlap.  The  remainder  of  the  proof  will 

hinge  on  this  fact  that  these  boundaries  overlap. 

We  now  construct  a  shattered  set  of  points  of  size  d  +  1.  Consider  constructing  a  binary 
tree  with  2d+l  leaves  as  follows.  The  root  node  contains  hd+ 1  (call  this  level  d  +  1).  Let  hd  G 
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Bf^d)  (hd+i,r/A)  be  some  classifier  with  F(hd(X)  7^  hd+i(X))  >  0.  Let  the  left  child  of  the  root 
be  hd+i  and  the  right  child  be  hd  (call  this  level  d).  Define  Ad  =  {x  :  hd(x)  7^  hd+i(x)},  and 
let  Ad  =  2~<'d+2'>F(Ad).  Now  for  each  £  G  {d  —  1,  d  —  2, . . . ,  0}  in  decreasing  order,  we  define 
the  £  level  of  the  tree  as  follows.  Let  Tl+ ,  denote  the  nodes  at  the  £  +  1  level  in  the  tree,  and  let 
^  =  rw,  cl^t)  h.  We  iterate  over  the  elements  of  Te+1  in  left-to-right  order,  and  for  each  one 
h,  we  find  h!  G  ByW  (h,  A^+1)  with 

Fx>{h(x)  7^  h'(x)  A  x  G  A!^)  >  0  . 

We  then  define  the  left  child  of  h  to  be  h  and  the  right  child  to  be  In! ,  and  we  update 

Adt  fl  {x  :  h(x)  7^  h\x)}  . 

After  iterating  through  all  the  elements  of  7/:+ 1  in  this  manner,  define  At  to  be  the  final  value  of 
A!t  and  At  =  2~<'d+2'lF(Ai).  The  key  is  that,  because  every  h  in  the  tree  is  within  r / 2  of  hd+ 1,  the 
set  A!t  always  has  nonzero  measure,  and  is  contained  in  dy^h  for  any  h  G  Tg+  \ ,  so  there  always 
exists  an  h!  arbitrarily  close  to  h  with  P  v(h(x)  7^  h'(x)  Ax  G  A'e)  >  0. 

Note  that  for  £  G  {0, 1,  2, ,  d},  every  node  in  the  left  subtree  of  any  h  at  level  £  +  1  is 
strictly  within  distance  2  At  of  h,  and  every  node  in  the  right  subtree  of  any  h  at  level  £  +  1  is 
strictly  within  distance  2  A^  of  the  right  child  of  h.  Thus, 

F(3ti  G  Tt,  h"  G  Subtree(ti)  :  h\x)  ^  h"(x))  <  2d+12Ag. 

Since 

2d+12Ai  =  F(Ai)  =F(x  E  P|  Qy{t,b!  and  V  siblings  hi,  h2  G  Tg,  hi(x)  7^  h2(x)), 

h'£Te+1 

there  must  be  some  set 

A*e  =  {x  G  P)  dyW h!  s.t.  Vsiblings  hi,  h2  G  Tg,  h1(x)^h2(x) 

h'GTp. |_i 

and  V7i  G  Tg,  h!  G  Subtree(h),h(x)  =  h'(x)}  C  A g 
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with  P(A*)  >  0.  That  is,  for  every  h  at  level  £  +  1,  every  node  in  its  left  subtree  agrees  with  h  on 
every  i  e  Af*  and  every  node  in  its  right  subtree  disagrees  with  h  on  every  x  G  A\.  Therefore, 
taking  any  {x{),  x , ,  x2,  ■  ■  ■ ,  xd}  such  that  each  xf  £  A\  creates  a  shatterable  set  (shattered  by  the 
set  of  leaf  nodes  in  the  tree).  This  contradicts  VC  dimension  d,  so  we  must  have  the  desired 
claim  that  the  maximum  recursion  depth  is  at  most  d.  □ 


Before  completing  the  proof  of  Theorem  13.51  we  have  two  additional  minor  concerns  to 
address.  The  first  is  that  the  confidence  level  in  Lemma  13.121  is  slightly  smaller  than  needed  for 
the  theorem.  The  second  is  that  Lemma  13 .121  only  applies  when  Ap(e,  5,  h*)  <  oo  for  all  e  >  0. 
We  can  address  both  of  these  concerns  with  the  following  lemma. 


Lemma  3.14.  Suppose  (C,  V)  is  such  that  C  has  finite  VC  dimension  d,  and  suppose 
A{(e,  S,  h*)  is  a  label  complexity  for  (C,  V).  Then  there  is  a  label  complexity  Aa(e,  5,  h*)  for 
(C,  V)  s.t.  for  any  6  £  (0, 1/4)  and  e  £  (0, 1/2), 


Aa(e,  <5,  h*)  <  (k  +  2)  max 


mm 


in  {A; (e/2, 45,  h*),  ] 

(k  +  1)272  log(4(/c  +  l)2/(5) 


I 


where  k  =  |~log(<5/2)/ log(4<5)~|. 


Proof  Suppose  A'a  is  the  algorithm  achieving  A/(e,  4,  IT).  Then  we  can  define  a  new  algorithm 
Aa  as  follows.  Suppose  t  is  the  budget  of  label  requests  allowed  of  Aa  and  5  is  its  confidence 
argument.  We  partition  the  indices  of  the  unlabeled  sequence  into  k  +  2  infinite  subsequences. 
Fori  £  {1,2,...,  k},  let  ht  =  A'a(t/ (/c  +  2),4<5),  each  time  running  A'a  on  a  different  one  of  these 
subsequence,  rather  than  on  the  full  sequence.  From  one  of  the  remaining  two  subsequences,  we 
request  the  labels  of  the  first  t/(k  +  2)  unlabeled  examples  and  let  hk+1  denote  any  classifier  in  C 
consistent  with  these  labels.  From  the  remaining  subsequence,  for  each  i,j  £  {1,2,...,  k+ 1}  s.t. 
¥(hi(X)  hj(X))  >  0,  we  find  the  first  |  t/((k  +  2 ){k  +  l)k)\  examples  x  s.t.  hfix)  hj(x), 
request  their  labels  and  let  rnZJ  denote  the  number  of  mistakes  made  by  ht  on  these  labels  (if 
P(/ii(JA)  f  hj(X ))  =  0,  we  let  =  0).  Now  take  as  the  return  value  of  Aa  the  classifier  h;t 
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where  i  =  arg  min,;  maxj  m,  j . 


Suppose  t  >  Aa(e,  5,h*).  First  note  that,  by  a  Hoeffding  bound  argument  (similar  to  the 
proof  of  Theorem  13 .71).  t  is  large  enough  to  guarantee  with  probability  >  1  —  5/2  that  erf/;,-)  < 
2  min,  er(hi).  So  all  that  remains  is  to  show  that,  with  probability  >  1  —  5/2,  at  least  one  of 
these  hi  has  er(hi )  <  e/2. 


If  Alfe/2, 45,  h 


(e.g.,  [IBlumer  et  al 


> 


1989 


Devrove  et  al.. 


-,  then  the  classic  results  for  consistent  classifiers 


1996, 


Vapnik, 


1982]])  guarantee  that,  with  probability 


>  1  —  5/2,  er{hk+ 1)  <  e/2.  Otherwise,  we  have  t  >  (k  +  2)A/(e/2,45,  h*).  In  this  case,  each 
of  hi, ...  ,hk  has  an  independent  >1  —  45  probability  of  having  er(hi)  <  e/2.  The  probability 
at  least  one  of  them  achieves  this  is  therefore  at  least  1  —  (45)fc  >1  —  5/2.  □ 


We  are  now  ready  to  combine  these  lemmas  to  prove  Theorem  13 .5 


Theorem\3.5\  Theorem  13.51  now  follows  by  a  simple  combination  of  Lemmas  13 . 1 21  and  13 . 1 31 
along  with  Theorem  13.71  and  Lemma  13.141  That  is,  the  passive  learning  algorithm  achieving 
passive  learning  label  complexity  Ap(e,  5,  h)  on  (C,  V)  also  achieves  passive  label  complexity 
Ap(e,  5,  h)  =  minf/<f  |~Ap(e',  5,  h)}  on  any  (Q,  V),  where  Ci,  C2, ...  is  the  decomposition  from 
Lemma l3T3l  So  Lemma l3T2l guarantees  the  existence  of  active  learning  algorithms  Ai,A2,... 
such  that  Ai  achieves  a  label  complexity  A,(e,  25,  h)  =  o(Ap(e,  5,  h))  on  (Q,  V)  for  all  5  >  0 
and  h  G  Q  s.t.  Ap(e,  5,  h)  is  finite  and  cu(l).  Then  Theoreml3.7ltells  us  that  this  implies  the  exis¬ 
tence  of  an  active  learning  algorithm  based  on  these  Ai  combined  with  Algorithm  4  ,  achieving 
label  complexity  A'a(e,  45,  h)  =  o(Ap(e/2,  5,  h))  on  (C,  V),  for  any  5  >  0  and  h  s.t.  Ap(e/2,  5,  h) 
is  always  finite  and  is  cu(l).  Lemma  13.141  then  implies  the  existence  of  an  algorithm  achiev¬ 
ing  label  complexity  A„(c,5,  h)  6  0(min{Aa(e/2, 45,  h),  log(l/e)/e})  C  o(Ap(e/4,  5,  h))  C 
o(Ap(e/4,  5,  h))  for  all  5  G  (0, 1/4)  and  all  he  C.  □ 

Note  there  is  nothing  special  about  4  in  Theoreml3.5l  Using  a  similar  argument,  it  can  be  made 
arbitrarily  close  to  1. 
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Chapter  4 


Activized  Learning:  Transforming  Passive 
to  Active  With  Improved  Label  Complexity 


In  this  chapter,  we  prove  that,  in  the  realizable  case,  virtually  any  passive  learning  algorithm  can 
be  transformed  into  an  active  learning  algorithm  with  asymptotically  strictly  superior  label  com¬ 
plexity,  in  many  cases  without  significant  loss  in  computational  efficiency.  We  further  explore 
the  problem  of  learning  with  label  noise,  and  find  that  even  under  arbitrary  noise  distributions, 
we  can  still  guarantee  strict  improvements  over  the  known  results  for  passive  learning.  These  are 
the  most  general  results  proven  to  date  regarding  the  advantages  of  active  learning  over  passive 
learning. 


4.1  Definitions  and  Notation 


As  in  previous  chapters,  all  of  our  asymptotics  notation  in  this  chapter  will  be  interpretted  as 
e  \  0,  when  stated  for  a  function  of  e,  the  desired  excess  error,  or  as  n  — >  oo  when  stated  for 
a  function  of  n,  the  allowed  number  of  label  requests.  In  particular,  recall  that  for  two  functions 
4>i  and  cj) 2,  we  say  </>i(e)  =  o(</>2(e))  iff  lim  =  0.  Throughout  the  chapter,  the  o  notation,  as 
well  as  “O,”  “fi,”  “u”  “<C,”  and  where  used,  should  be  interpreted  purely  in  terms  of  the 
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asymptotic  dependence  on  e  or  n,  with  all  other  quantities  held  constant,  including  T>xy,  <5,  and 
C,  where  appropriate. 


Definition  4.1.  Define  the  set  of  functions  polynomial  in  the  logarithm  of  1/e  as  follows. 


Polylog(l/e)  =  {f  :  [0,1]  ->  [0,  oopfc  E  [0,  oo)  s.t.  =  0(logfc(l/e))}. 


Definition  4.2.  We  say  an  active  meta-algorithm  Aa  activizes  a  passive  algorithm  Apfor  C 
under  D  if  for  any  label  complexity  Ap  achieved  by  Ap,  Aa(Ap,  •)  achieves  label  complexity  Aa 
such  that  for  all  V  E  D, 

Ap(e  +  z/(C,  ID),  ID)  G  Polylog(  1/e)  =>■  Aa(e  +  a(C,  D),  D)  G  Polylog(  1/e),  and  if 
A  p(e  +  a(C,  D),D)  <  oo  and  Ap(e  +  z/(C,  D),  D)  Polylog(  1/e),  t/ien  dzere  exists  a  finite 
constant  c  such  that 

_ Aq(c6  +  a(C,  P),P)  =  o(Ap(e  +  a(C,D),P)). _ 

Note  that,  in  keeping  with  the  reductions  spirit,  we  only  require  the  meta- algorithm  to  suc¬ 
cessfully  improve  over  the  passive  algorithm  under  conditions  for  which  the  passive  algorithm 
is  itself  a  reasonable  learning  algorithm  (Ap  <C  oo).  Given  a  meta- algorithm  satisfying  this  con¬ 
dition,  it  is  a  trivial  matter  to  strengthen  it  to  successfully  improve  over  the  passive  algorithm 
even  when  the  passive  algorithm  is  not  itself  a  reasonable  method,  simply  by  replacing  the  pas¬ 
sive  algorithm  with  an  aggregate  of  the  passive  algorithm  and  some  reasonable  general-purpose 
method,  such  as  empiricial  error  minimization.  For  simplicity,  we  do  not  discuss  this  matter 
further. 

We  will  generally  refer  to  any  meta-algorithm  Aa  that  activizes  every  passive  algorithm  Ap 
for  C  under  ID  as  a  general  activizer  for  C  under  D.  As  we  will  see,  such  general  activizers  do 
exist  under  “Jlealizable( C),  under  mild  conditions  on  C.  However,  we  will  also  see  that  this  is 
typically  not  true  for  the  noisy  settings. 
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4.2  A  Basic  Activizer 


In  the  following,  we  adopt  the  convention  that  any  set  of  classifiers  V  shatters  {}  iff  V  ^  {}  (and 


1998],  as  usual).  Furthermore,  for  convenience, 


otherwise,  shattering  is  defined  as  in  IVapnik . 
we  will  define  X°  =  {{}}. 

Let  us  begin  by  motivating  the  approach  we  will  take  below.  Similarly  to  ChaptcrQ]  define  the 
boundary  as  dc T>Xy  =  lim  DIS(C(r)).  If  F(dc'DXy)  =  0,  then  methods  based  on  sampling  in 

r\ 0 

the  region  of  disagreement  and  inferring  the  labels  of  examples  not  in  the  region  of  disagreement 
should  be  effective  for  activizing  (in  the  realizable  case).  On  the  other  hand,  if  P(<9c T^xy)  >  0, 
then  such  methods  will  fail  to  focus  the  sampling  region  beyond  a  constant  fraction  of  X,  so 
alternative  methods  are  needed.  To  cope  with  such  situations,  we  might  exploit  the  fact  that  the 
region  of  disagreement  of  the  set  of  classifiers  with  relatively  small  empirical  error  rates  on  a 
labeled  sample  (call  this  set  C(r))  converges  to  drVXy  (up  to  measure-zero  differences).  So, 
for  a  large  enough  labeled  sample,  a  random  point  x  G  DIS(C(r))  will  probably  be  in  the 
boundary  region.  We  can  exploit  this  fact  by  using  x  to  split  C(r)  into  two  subsets:  V+  = 
{h  G  C(t)  :  h(x)  =  +1}  and  V-  —  {h  G  C(r)  :  h{x)  =  —1}.  Now,  if  x  G  dc VXy, 
then  ^irif  er(h )  =  ^inf  er(h)  =  u(C,VXy).  So,  for  almost  every  point  x'  G  X  \  DIS(V+), 
we  can  infer  a  label  for  this  point,  which  will  agree  with  some  classifier  whose  error  rate  is 
arbitrarily  close  to  u(C ,VXy),  and  similarly  for  V—  In  particular,  in  the  realizable  case,  this 
inferred  label  is  the  target  function’s  label,  and  in  the  benign  noise  case,  it  is  the  Bayes  optimal 
classifier’s  label  (when  t]{x')  ^  1/2).  We  can  therefore  infer  the  label  of  points  not  in  the  region 
DIS(y+)  (T  DIS(V-),  thus  effectively  reducing  the  region  we  must  request  labels  in.  Similarly, 
this  region  converges  to  a  region  dv+VXy  fl  dy_T>Xy.  If  this  region  has  zero  probability,  then 
sampling  from  DIS(V+)  IT  DI S {VS)  effectively  focuses  the  sampling  distribution,  as  needed. 
Otherwise,  we  can  repeat  this  argument;  for  large  enough  sample  sizes,  a  random  point  from 
DIS(V+)  n  DIS{yJ)  will  likely  be  in  dy+T^x y  0  dy  'Dxy,  and  therefore  splits  C(r)  into  four 
sets  with  z/(C,  T>xy)  optimal  error  rates,  and  we  can  further  focus  the  sampling  region  in  this 
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way.  We  can  repeat  this  process  as  needed  until  we  get  a  partition  of  C(r)  with  a  shrinking 
intersection  of  regions  of  disagreement.  Note  that  this  argument  can  be  written  more  concisely 
in  terms  of  shattering.  That  is,  a  point  in  DIS(C(r))  is  simply  a  point  that  C(r)  can  shatter. 
Similarly,  a  point  x'  G  DIS(V+)  D  DIS(V_ )  is  simply  a  point  s.t.  C(r)  shatters  (x,  x'},  etc. 

The  above  simple  argument  leads  to  a  natural  algorithm,  which  effectively  improves  label 
complexity  for  confidence-bounded  error  in  the  realizable  case.  However,  to  achieve  improve¬ 
ments  in  the  label  complexity  for  expected  error,  it  is  not  sufficient  to  merely  have  the  probability 
of  a  random  point  in  DIS(C(r))  being  in  the  boundary  converging  to  1,  as  this  could  happen  at 
a  slow  rate.  To  resolve  this,  we  can  replace  the  single  sample  x  with  multiple  samples,  and  then 
take  a  majority  vote  over  whether  to  infer  the  label,  and  which  label  to  infer  if  we  do. 

The  following  meta-algorithm,  based  on  these  observations,  is  central  to  the  results  of  this 
chapter.  It  depends  on  several  parameters,  and  two  types  of  estimators:  A^(-,  •)  and  rW(-,  •,  •); 
one  possible  definition  for  these  is  given  immediately  after  the  meta-algorithm,  along  with  a 
discussion  of  the  roles  of  these  various  parameters  and  estimators. 

Meta- Algorithm  5  :  Activizer(Ap,n ) 

Input:  passive  algorithm  Ap,  label  budget  n 

Output:  classifier  h 

0.  Request  the  first  |_n/3j  labels  and  let  Q  denote  these  |_n/3j  labeled  examples 

1.  Let  V  =  {h  e  C  :  erg(fi)  —  min  erq(h')  <  r} 

2.  Let  U\  be  the  next  mn  unlabeled  examples,  and  U2  the  next  mn  examples  after  that 

3.  For  k  —  1,  2, . . . ,  d  +  1 

4.  Let  Ck  denote  the  next  \n/(6  ■  2kh^(1Ji1  U2))\  unlabeled  examples, 

5.  For  each  x  6  Ck, 

6.  If  (x,  U2)  >1  —  7,  and  we’ve  requested  <  |  n/ (3  •  2fc)J  labels  in  Ck  so  far, 

7.  Request  the  label  of  x  and  replace  it  in  Ck  by  the  labeled  one 

8.  Else,  label  x  with  argmax  (x,  y,  U2)  and  replace  it  in  £k  by  the  labeled  one 

y£{— i)+i} 

9.  Return  ActiveS elect {{Ap(Ci),  AP{C2), . . . ,  Ap(Cd+i)},  l rz / 3 J ) 

Subroutine:  Active  Select  ({hi ,  h2, . . . ,  h^},  m) 

0.  For  each  j,  k  E  { 1,2, ,  N}  :  j  <  k, 

1.  Take  the  next  [m/ (^)J  examples  x  s.t.  hj(x)  ^  hk(x)  (if  such  examples  exist) 

2.  Let  rrijk  and  rrikj  respectively  denote  the  number  of  mistakes  h:j  and  hk  make  on  these 

3.  Return  hk,  where  k  =  argmin^  inax?  mkj 
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The  meta-algorithm  has  several  parameters  to  be  specified  below. 

As  with  Algorithm  0  and  the  agnostic  generalizations  thereof,  the  set  V  can  be  represented 
implicitly  by  simply  performing  each  step  on  the  full  space  C,  subject  to  the  constraint  given  in 
the  definition  of  V,  so  that  we  can  more  easily  adapt  algorithms  that  are  designed  to  manipulate 
C.  Note  that,  since  this  is  the  realizable  case,  the  choice  of  r  =  0  is  sufficient,  and  furthermore 
enables  the  possibility  of  an  efficient  reduction  to  the  passive  algorithm  for  many  interesting 
concept  spaces.  The  choice  of  7  is  fairly  arbitrary;  generally,  the  proof  requires  only  that  7  E 
(0,1). 

The  design  of  the  estimators  A^iUi.U-z),  A^^x,  U2),  and  y,  U2)  can  be  done  in 

a  variety  of  ways.  Generally,  the  only  important  feature  seems  to  be  that  they  be  converging 
estimators  of  an  appropriate  limiting  values.  For  our  purposes,  given  any  m  E  N  and  sequences 
U\  =  { z\ , . . . ,  Zmj  E  Xm  and  U2  =  {zm+ 1,  zm+ 2, . . . ,  z2m}  E  Xm,  the  following  definitions  for 
A(fc)(Wi, U2),  AW(z,W2),  and  y, U2)  will  suffice.  Generally,  we  define 

A =  -G  +  -  y\  1[A<‘>( z,U2)  >1-7].  (4.1) 

m1'6  m 

z^JA\ 

For  the  others,  there  are  two  cases  to  consider.  If  k  —  1,  the  definitions  are  quite  simple: 

f  {L\x,y,U2)  =  1  [V/i£  V,h(x)  =  y], 


A {l\z,U2)  =  l[z  E  DIS( V)}. 

For  the  other  case,  namely  k  >  2,  we  first  partition  U2  into  subsets  of  size  k  —  1,  and  record 
how  many  of  those  subsets  are  shattered  by  V:  for  i  E  {1,2,...,  \_m/ ( k  —  1) J },  define  S ^  = 

(  r 

{^m+i+(i_i)(fc_i), . . . ,  zm+i{k_  1)},  and  let  Mk  =  max  <  1,  ^  IV  shatters  5, 


i=  1 


(fc) 


Then 


define  V(XiV)  —  {h  E  V  :  h(x)  =  y},  and 


\m/{k-l)\ 


f  ^(x,y,U2)=  1 


V  shatters  Sjk>  and  V(x_y)  does  not  shatter  S\k> 


(4.2) 


i= 1 


A (k\z,U2)  simply  estimates  the  probability  that  S  U  {z}  is  shatterable  by  V  given  S  shatterable 
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by  V,  as  follows. 


A«(2,W2)  =  — ^  +  —  V  1[C  shatters  Sfu{z}].  (4.3) 

m;/3  Mk  ^ 

The  following  theorem  is  the  main  result  on  activized  learning  in  the  realizable  case  for  this 
chapter. 

Theorem  4.3.  Suppose  C  is  a  VC  class,  0  <  r  =  o(l),  mn  >  n,  and  7  G  (0, 1)  is  constant.  Let 
A  A)  and  f  A)  be  defined  as  in  (14.11).  (14.31).  and  (14.21). 

For  any  passive  algorithm  Ap,  Meta-Algorithm  5  activizes  Apfor  C  under  Realizable^ C). 
More  concisely,  Theorem  14 . 3 1  state s  that  Meta- Algorithm  5  is  a  general  activizer  for  C.  We 

can  also  prove  the  following  result  on  the  fixed-confidence  version  of  label  complexity!] 
Theorem  4.4.  Suppose  the  conditions  of  Theorem\4.3\hold.  and  that  Ap  achieves  a  label 
complexity  Ap.  Then  Activizer (Ap,  ■)  achieves  a  label  complexity  Aa  such  that,  for  any 
5  E  (0, 1)  and  V  E  Realizable(C),  there  is  a  finite  constant  c  such  that 
A p(e,  cS,  V)  =  0(1)  =A  A a(ce,  cS,  V)  =  0(1)  and 

A p(e,  S,  V)  =  cu(l)  =>-  A a(ce,  c8,  V)  =  o( Ap(e,  5,  V)). 

The  proof  of  Theorems l4.3land!4.4lare  deferred  to  Sectionl4~4l 

For  a  more  concrete  implication,  we  immediately  get  the  following  simple  corollary. 


Corollary  4.5.  For  any  VC  class  C,  there  exist  active  learning  algorithms  that  achieve  label 
complexities  Aa  and  Aa,  respectively,  such  that  for  all  T>xy  £  ‘Realizable^  C), 

A a(e,VXY)  =  o(l/e),  and  V5  E  (0, 1),  An(e,  8,  Vxv )  =  o(l/e).  _ 


Proof  For  d  —  0,  the  result  is  trivial.  For  d  >  1, 


Haussler.  Littlestone,  and  Warmuth  [119941 


propose  passive  learning  algorithms  achieving  respective  label  complexities  Ap(e,  Vxv)  =  7 
and  Ap(e,  8,  VXy)  <  ™  In  A  Plugging  this  into  Theorems  14.31  and  14.41  implies  that  applying 
Meta- Algorithm  5  to  these  passive  algorithms  yield  combined  active  learning  algorithms  with 
the  stated  behaviors  for  Aa  and  Aa.  □ 


'in  fact,  this  result  even  holds  for  a  much  simpler  variant  of  the  algorithm,  where  f  and  A(k>  can  be  replaced 
by  an  estimator  that  uses  a  single  random  S  E  Xk~1  shattered  by  V,  rather  than  repeated  samples. 
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For  practical  reasons,  it  is  interesting  to  note  that  all  of  the  label  requests  in  Meta- Algorithm 
5  can  be  performed  in  three  batches:  the  initial  n/3,  the  requests  during  the  cl+ 1  iterations  (which 
can  all  be  requested  in  a  single  batch),  and  the  requests  for  the  ActiveSelect  procedure.  Flowever, 
because  of  this,  we  should  not  expect  Meta-Algorithm  5  to  have  optimal  label  complexities.  In 
particular,  to  get  exponential  rates,  we  should  expect  to  need  @(n)  batches.  That  said,  it  should 
be  possible  to  construct  the  sets  Ck  sequentially,  updating  V  after  each  example  added  to  Ck,  and 
requesting  labels  as  needed  while  constructing  the  set,  analogous  to  Algorithm  0.  Some  care  in 
the  choice  of  stopping  criterion  on  each  round  is  needed  to  make  sure  the  set  Ck  still  represents  an 
i.i.d.  sample.  Such  a  modification  should  significantly  improve  the  label  complexities  compared 
to  Meta- Algorithm  5,  while  still  maintaining  the  validity  of  the  results  proven  here. 

Note:  The  restriction  to  VC  classes  is  not  necessary  for  positive  results  in  activized  learning. 
For  instance,  even  if  the  concept  space  C  has  infinite  VC  dimension,  but  can  be  decomposed 
into  a  countable  sequence  of  VC  class  subsets,  we  can  still  construct  an  activizer  for  C  using  an 
aggregation  technique  similar  to  that  introduced  in  Chapter[3] 

4.3  Toward  Agnostic  Activized  Learning 

We  might  wonder  whether  it  is  possible  to  state  a  result  as  general  as  Theorem  14 .31  even  for  the 
most  general  setting  Agnostic.  However,  one  can  construct  VC  classes  C,  and  passive  algorithms 
Ap  that  cannot  be  activized  for  C,  even  under  bounded  noise  distributions  (7sybakov( C,  1,  /i)), 
let  alone  Agnostic.  These  algorithms  tend  to  have  a  peculiar  dependence  on  the  noise  distribu¬ 
tion,  so  that  if  the  noise  distribution  and  h*  align  in  just  the  right  way,  the  algorithm  becomes 
very  good,  and  is  otherwise  not  very  good;  the  effect  is  that  we  cannot  lose  much  information 
about  the  noise  distribution  if  we  hope  to  get  these  extremely  fast  rates  for  these  particular  dis¬ 
tributions,  so  that  the  problem  becomes  more  like  regression  than  classification.  However,  as 
mentioned,  these  passive  algorithms  are  not  very  interesting  for  most  distributions,  which  leads 
to  an  informal  conjecture  that  any  reasonable  passive  algorithm  can  be  activized  for  C  under 
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Agnostic.  More  formally,  I  have  the  following  specific  conjecture. 

Recall  that  we  say  h  is  a  minimizer  of  the  empirical  error  rate  for  a  labeled  sample  C  iff 

h  G  argmin erc(h'). 

Conjecture  4.6.  For  any  VC  class  C,  there  exists  a  passive  algorithm  Ap  that  outputs  a 
minimizer  of  the  empirical  error  rate  on  its  training  sample  such  that  some  active 

meta-algorithm  Aa  activizes  Apfor  C  under  Agnostic. 

Although,  at  this  writing,  this  conjecture  remains  open,  the  rest  of  this  section  may  serve  as 

evidence  in  its  favor. 

4.3.1  Positive  Results 

First,  we  have  the  following  simple  lemma,  which  allows  us  to  restrict  the  discussion  to  the 

rBenignNoise(C)  case. 

Lemma  4.7.  For  any  C,  if  there  exists  an  active  algorithm  Aa  achieving  label  complexities  Aa 
and  Aa,  then  there  exists  an  active  algorithm  A’a  achieving  label  complexities  A'a  and  A'a  such 
that,  W  G  Agnostic  and  8  G  (0,1),  for  some  functions  A(e,  X>),  A(e,  8,D)  G  Polylog(l/e), 
IfV  G  rBenignNoise( C),  then 

K(e  +  u{ C,  V),V)<  max{2  (Aa(e/2  +  C,  V),V)] ,  A(e,  £>)}, 

K(e  +  u{C,  V),  5 ,  V)  <  max{2  |"Aa(e  +  v{C,  V ),  6/ 2,  V)] ,  A(e,  8,  V)}, 

and  ifV  ‘BenignNoise(C ),  then 

A'a(e  +  v(C,V),V)<\(e,V), 

A'a(e  +  u(C,  V),  8,  V)  <  A(e,  8,  V). 

Proof.  Consider  a  universally  consistent  passive  learning  algorithm  Au.  Then  A„  achieves  label 
complexities  Au  and  Au  such  that  for  any  distribution  V  on  X  x  {  — 1,+1},  Ve,  8  G  (0,1), 
Au(e/2  +  f3(V),  V)  and  Au(e/2  +  j3(V),8/2,  V)  are  both  finite.  In  particular,  if  /3(V)  <  v(C,  V), 
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then  Au{e/2  +  z/(C,  V),  V)  =  0(1)  and  An(e/2  +  z/(C,  V),  5/2,  V)  =  0(1). 

Now  we  simply  run  Aa([n/ 2J),  to  get  a  classifier  ha,  and  run  Au{Z[n/3 j)  (after  requesting 
those  first  |_?t./3J  labels),  to  get  a  classifier  hu.  Take  the  next  n  —  [n/ 2J  —  [rz/3j  unlabeled 
examples  and  request  their  labels;  call  this  set  C.  If  erc(ha )  —  erc(hu)  >  n-1/3,  return  h  =  hu; 
otherwise,  return  h  —  ha.  I  claim  that  this  method  achieves  the  stated  result,  for  the  following 
reasons. 

First,  let  us  examine  the  final  step  of  this  algorithm.  By  Hoeffding’s  inequality,  the  probability 
that  er(h )  ^  min {er(ha),  er(hu)}  is  at  most  2exp{—  n1</3/24}. 

Consider  the  case  where  V  G  BenignN  oise(C).  For  any  n  >  2|"Aa(e/2  +  u(C,  V),  T>)\, 
E [er{ha)\  <  z/(C,  V)  +  e/2,  so  E [er(h)]  <  z/(C,  V)  +  e/2  +  2exp{— n1/,3/24},  which  is  at  most 
^(C,  V)  +  e  if  n  >  243  In3  |.  Also,  for  any  n  >  2[Aa(e  +  z/(C,  V),  5/2,  V)] ,  with  probability  at 
least  1  —  5/2,  er(ha)  <  u(C,  V)  +  e.  If  additionally,  n  >  243  In3  |,  then  a  union  bound  implies 
that  with  probability  >1  —  5,  er(h )  <  er{ha)  <  u(C,  V)  +  e. 

On  the  other  hand,  if  V  </  'BenignN oise(C),  then  for  any  n  >  3\Au(u(C,V)  +  e/2,V)~\, 
E [er(h)\  <  E[min {er{ha),er{hu)}}  +  2exp{—  n1/,3/24}  <  E [er(hu)}  +  2exp{—  n1/3/24}  < 
^(C,  V)  +e/2  +  2exp{—  rz1/3/24}.  Again,  this  is  at  most  z/(C,  V)  +  e  if  n  >  243  In3  K  Similarly, 
for  any  n  >  3[AU(^(C,  X>)+e,  5/2,  V)]  =  0(1),  with  probability  >  1—5/2,  er(hu)  <  u(C,V)  + 
e.  If  additionally,  n  >  243  In3  |,  then  a  union  bound  implies  that  with  probability  >1  —  5, 
er(ji )  <  er{hu)  <  v{/C,V)  +  e. 

Thus,  we  can  take  A (e,  V)  =  max{243  In3  3[Au(z/(C,  V)  +  e/2,  £>)] }  e  Polylog(  1/e). 
andA(e,5,P)  =  max{243  In3  3|"Au(z/(C,  P)  +  e,  5/2,  P)"| }  G  Polylog{l/ e).  □ 

Because  of  Lemma l4~7l  it  suffices  to  focus  our  discussion  purely  on  the  'BenignN oise(C) 
case,  since  any  label  complexity  results  for  BenignN oise(C)  immediately  imply  almost  equally 
strong  label  complexity  results  for  Agnostic,  losing  only  an  additive  polylogarithmic  term.  With 
this  in  mind,  we  state  the  following  active  learning  algorithm,  designed  for  the  BenignN oise(C) 
setting. 
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Meta-Algorithm  6:  BenignActivizer(Ap,  n) 

Input:  passive  algorithm  Ap,  label  budget  n 
Output:  classifier  h 

0.  Request  the  first  |_?t,/3J  labels  and  let  Q  denote  these  \n/ 3J  labeled  examples 

1.  Let  V  =  {h  G  C  :  erg(/i)  —  min  erQ(h')  <  r} 

2.  Let  U2  be  the  next  mn  unlabeled  examples 

3.  For  k  —  1,  2, . . . ,  d 

4-  Qk  {} 

5.  For  t  —  1,2, ,  |_2n/(3  •  2fc)J 

6.  Let  x1  be  the  next  unlabeled  example  for  which  mirij<fc  (x,  U2)  >1  —  7 

7.  Request  the  label  y’  of  x'  and  let  Qk  <—  Qk  U  {(V,  y')} 

8.  Construct  the  classifier  hk,  for  k  e  {1,  2, . . . ,  d  +  1}  (see  description  below) 

9.  Return  hy  for  k  =  max  jfc  :  maxj<k  erQj(hk )  -  erQj(hj )  <  Tkj 

The  definition  of  hk  in  Step  8  of  Meta- Algorithm  6  is  as  follows. 

Let  hk  =  Ap(Qk),  k'(x)  =  min{fc'  :  hSk'\x1U2)  <  1  —  7},  and 

{arg  max  f' y,  U2),  if  k'(x)  <  k 
hk{x),  otherwise 

For  the  threshold  Tkj  in  Step  9  of  Meta-Algorithm  6,  for  our  purposes,  we  can  take  the 
following  definition. 


Tkj  —  5-i 


'2048dln(1024d)  +  ln(32(d  +  1  )/5) 


I  Qk 


It  is  interesting  to  note  that  this  algorithm  requires  only  two  batches  of  label  requests,  which 
is  clearly  the  minimum  number  for  any  algorithm  that  takes  advantage  of  the  sequential  aspects 
of  active  learning.  However,  even  with  this,  we  have  the  following  general  results. 


Theorem  4.8.  Let  r  =  ^  4-  7 y  1,1 ' !l‘  d  ,  5  e  (0, 1),  and  let  and  be  defined  as 

in  (l4~il  (IQ).  and  Q.  For  any  VC  class  C,  by  applying  Meta- Algorithm  6  with  Ap  being  any 
algorithm  outputting  a  minimizer  of  the  empirical  error  rate  from  C,  the  combined  active 
algorithm  achieves  a  label  complexity  Aa  such  thatVD  6  F>enignNoise(C), 

Aa(e  +  u(C,  V),  6,  V)  =  o(l/ e2). 
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The  proof  of  Theoreml4.8lis  included  in  Sectionl4.4.11  Theoreml4.81  combined  with  Lemma  IP 
immediately  implies  the  following  quite  general  corollary. 

Corollary  4.9.  For  any  VC  class  C,  and  5  6  (0, 1),  there  exists  an  active  learning  algorithm 
achieving  a  label  complexity  Aa  such  that,  VD  e  Agnostic, 

_ K(e  +  v(C,V),5,V)  =  o(  1/e2). _ 

Note  that  this  result  shows  strict  improvements  over  the  known  worst-case  (minimax)  label 

complexities  for  passive  learning. 


4.4  Proofs 

4.4.1  Proof  of  Theorems  14.31 14.41  and  14.81 

Throughout  this  subsection,  we  will  assume  C  is  a  VC  class,  0  <  r  =  o(l),  mn  >  n,  7  G  (0, 1), 
and  and  are  defined  as  in  (IP.  (IP  and  (IP.  as  stated  in  the  conditions  of  the 

theorems.  Furthermore,  we  will  define  V  =  {h  E  C  :  eryn/2,\(h)  —  miner|_„/3j (h')  <  r},  and 

/i/GC 

unless  otherwise  specified,  V\y  £  Agnostic  and  we  will  simply  discuss  the  behavior  for  this 
fixed,  but  arbitrary,  distribution. 

Also,  recall  that  we  are  using  the  convention  that  =  {{}}  and  we  say  a  set  of  classifiers 
V  shatters  {}  iff  V  7^  {}. 

Lemma  4.10.  For  any  N  6  N,  and  N  classifiers  {hi,  hn-,  ■  ■  ■ ,  h^}, 

ActiveS elect ({h\,  hi, . . . ,  hjy},  m)  makes  at  most  m  label  requests,  and  ifh~k  is  the  classifier 
output  by  ActiveS elect ({h\,  hi, ... ,  h^{,  m),  then  with  probability 
>  1  -  2 (N  -  l)exp{  —  (m/ (^))/72},  er(hk )  <  2  min ker(hk). 

Proof.  This  proof  is  essentially  identical  to  the  proof  of  Theorem  13 .71  from  Chapter|3j 

First  note  that  the  total  number  of  label  requests  used  by  ActiveSelect  is  at  most  m,  since 
each  pair  of  classifiers  uses  at  most  m/  (^)  requests. 
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Let  k**  =  argminfc  er(hk).  Now  for  any  j  G  {1,2,...,  N}  with  P (hj(X)  f  hk**(X ))  >  0, 
the  law  of  large  numbers  implies  that  with  probability  1  we  will  find  at  least  m/(^)  exam¬ 
ples  remaining  in  the  sequence  for  which  hj(x)  f  hk**  (x),  and  furthermore  since  er(hk**  \{x  : 
hj(x )  f  hk**(x )})  <  1/2,  Hoeffding’s  inequality  implies  that  F(mk**j  >  (7/12 )m/{N2))  < 
exp{  —  (m/ (^))/72).  A  union  bound  implies 


P 


( 


ma xmk**j  >  ( 7/12)m / 

V  i 


<  (N  —  1  )exp  <  —  I  m/ 


/72 


Now  suppose  k  G  {1,  2, . . . ,  N}  has  er(hk)  >  2er{hk**).  In  particular,  this  implies  F(hk(X) 
hk**(X))  >  0  and  er(hk\{x  :  hk**(x )  f  hk(x)})  >  2/3.  By  Hoeffding’s  inequality,  we 
have  that  P (mkk**  <  ( 7/12)m / (vf) )  <  exp{— (m/ (^))/72|.  By  a  union  bound,  we  have  that 
P(3 k  :  er(hk )  >  2 er(hk**)  and  maxj  <  (7/12)m/((/))  <  (N  —  l)exp{  —  (m/(J/))/72). 

So,  by  a  union  bound,  with  probability  >  1  —  2(N  —  l)exp{  —  (m/  (^))/72),  for  the  k  chosen 
by  ActiveSelect, 


max m,  .  <  ma xmh  j  <  ( 7/12)m / 

A  J  A 


<  min  maxm^j, 

k:er(hk)>2er(hk**)  j 


and  thus  er(h j.)  <  2 er{hk**)  as  claimed.  □ 


Lemma  4.11.  There  is  an  event  Hn,  holding  with  prob 

some  C-dependent function  4>(n)  =  o(l),  V  C  C(<j)(n) 

ability  7 

)Vxy)- 

1  1  - 

exp{— y/n},  such  that  for 

Proof.  By  the  uniform  convergence  bounds  proven  by 

Vapnik 

11982 

],  for  a  C-dependent  finite 

constant  c,  with  probability  >  1  —  exp{— n1/2},  V  C  C  (cn  1,/4  +  t]T>xy)-  Thus,  the  result 
holds  for  0(n)  =  crT1^  +  r  =  o(l).  □ 


Lemma  4.12.  If  r  >  —  +  7 <i  tken  fkere 

such  that,  with  probability  >  1  —  1  /n,  C(0'(n);  T>Xy, 

is  a  strictly  positive  function  4>'{n)  =  o(l) 

C  V. 

Proof  By  the  uniform  convergence  bounds  proven  by 

Vannik 

[1982 

],  with  probability  1  —  1  In, 

every  he  C  has  |er(/i)  —  er|_„/3j  (h)\  <  r/3.  Therefore,  on  this  event,  V  D  C(r/3;  VXY).  Thus, 
we  can  let  =  r/3,  which  satisfies  the  desired  conditions.  □ 
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Lemma  4.13.  For  any  n  G  N,  there  is  an  event  H'nfor  the  data  sequence  Zyn/ 3j  with 


nK)  > 


i, 


ifVxY  G  Realizable^ C) 


1  —  1/n,  ifT>xY  ^  Realizable^ C)  but  r  >  —  +  7\ '  —  -  —  — 


s.t.  on  H'n,  for  any  k  G  {1,  2, . . . ,  d  +  1}  with  P(S'  G  Xk  1  :  lim  l[C(r)  shatters  5]  =  1)  >  0, 

r\ 0 


IP  (S'  G  Xk  :  V  shatters  S|  lim  l[C(r)  shatters  S]  =  1) 

r\0 


=  P(S  G  Xk  :  lim  1  [V(r)  shatters  S]  =  1|  lim  l[C(r)  shatters  S]  =  1)  =  1. 


r\0 


r\0 


Proof.  For  the  case  of  T>xy  ^  Realizable^ C)  and  r  >  ^  +  7\J ^-d-,  the  result  imme¬ 
diately  follows  from  Lemma  14.121  which  implies  that  on  an  event  of  probability  >  1  —  1/n,  for 
any  set  S,  t[V  shatters  S]  >  lim  t[V(r)  shatters  S]  =  lim  l[C(r)  shatters  S]. 

r\ 0  r\0 

Next  we  examine  the  case  where  DXy  G  Realizable  (C).  We  will  show  this  is  true  for  any 
fixed  k,  and  the  existence  of  H'n  then  holds  by  the  union  bound.  Fix  any  set  S  G  Xk~l  s.t. 
lim  l[C(r)  shatters  S]  =  1.  Suppose  V (r)  does  not  shatter  S  for  some  r  >  0.  Then  there  is  an 

r\0 

infinite  sequence  of  sets  {{h±\ h%\  •  •  • ,  li^l-i }},;  with  Vj  <  2fc_1,  P(x  :  h^\x)  ^  h*(x ))  \  0, 
such  that  each  {h±  \  . . . ,  h^-i}  Q  C(r)  and  shatters  S.  Since  V(r)  does  not  shatter  S,  1  — 
inf  l[3j  :  hf  </  V(r)\  =  inf  l[3j  :  h[-\z[n/ 3J)  h*(Z[n/3])\.  But 

l  J  l  J 

E[inf  m  :  hf(Z[n/ 3J)  ^  h*(Z[n/ 3J)]]  <  infE[l[3j  :  hf\z[n/3 j)  ^  h*(Z[n/ 3J)]] 

<  lim  |_n/3jP(x  :  h[l\x)  h*(x))  =  0, 

i— >oo  ‘  J 

where  the  second  inequality  follows  from  the  union  bound.  Therefore,  Vr  >  0, 

P(Zj  ri/3  G  X I  :  V(r)  does  not  shatter  S)  =  0  by  Markov’s  inequality.  Furthermore,  since 
1  [V (r)  does  not  shatter  S]  is  monotonic  in  r,  Markov’s  inequality  and  the  monotone  convergence 
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theorem  give  us  that 


¥{Z\_n/ 3j  G  X'[n^'i  :  lim  t[V (r)  does  not  shatter  S]  =  1) 

<  E[lim  1  \V{r)  does  not  shatter  5]]  =  limP(Zin/3i  G  X Ln/,;i  :  V[r)  does  not  shatter  S')  =  0. 

r\0  r\ 0 

This  implies  that 


P(Z,n/ 3j  G^Ln/3J  :  P (SeX^1  :  lim  l[V(r)  shatters  S]  =  0|  lim  1[C (r)  shatters  S]  =  1)  >  0) 

r\ 0  r\0 

=  limP(Zin/3i  :P(5'eT’fc_1 :  lim  l[V(r)  shatters  S']  =  0|  lim  l[C(r)  shatters  S]  =  1)  >£) 

£\0  r\0  r\0 

<  limP^i^j  :P(5'GA’fc_1:liml[C(r)  shatters  S]  —  1  lim  1  [V(r)  shatters  S])>£) 

<  lim  -E[P(S'  G  Xk~x :  lim  1  [C(r)  shatters  S]  —  1  lim  1  [V (r)  shatters  S'])]  (by  Markov’s  ineq) 

£\0  q  r\ 0  r\0 

=  lim  -E[l[lim  l[C(r)  shatters  S]  =  l]P(Zpj/3j  :  lim  t[V (r)  shatters  S]  =  0)]  (by  Fubini’s  thm) 


=  lim  0  =  0. 

5\o 


□ 


Lemma  4.14.  Suppose  fcGN  satisfies  P(S'  G  Xk  1  :  lim  l[C(r)  shatters  S']  —  1)  >  0.  There  is 

r\0 

a  function  q{n )  =  o(l)  such  that,  for  any  n  G  N,  on  event  //,,  D  H'n  ( defined  above), 

P(SI  G  Xkl  :  lim  l[C(r)  shatters  S']  =  0|1S  shatters  S')  <  q(n). 

r\0 


Proof  By  Lemmasl4.lllandl4.13l  we  know  that  on  event  Hn  n  H'n, 


P(S'  G  Xk  1  :  lim  l[C(r)  shatters  S']  =  0|1S  shatters  S') 

r\0 

P(S'  G  Xk~x  :  linv^o  l[C(r)  shatters  S]  =  0  and  V  shatters  S) 

P (S’  G  Xk~x  :  V  shatters  S') 

P(S'  G  Xk~l  :  limr^o  l[C(r)  shatters  S']  =  0  and  V  shatters  S ) 

P(S  G  Xk~x  :  limr^0  l[C(r)  shatters  S]  =  1) 

P(S'  G  Xk~x  :  limr\0  l[C(r)  shatters  S']  =  0  and  C (</>(n))  shatters  S') 
P (S  G  Xk~x  :  limr\0  l[C(r)  shatters  S']  =  1) 
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Define  q{ri)  as  this  latter  quantity.  Since 

P (S  G  Xk_1  :  lim  l[C(r)  shatters  S]  —  0  and  C(r')  shatters  S )  is  monotonic  in  r', 

r\ 0 

P(5  G  Xk~l  :  lim.r^o  l[C(r)  shatters  S]  =  0  and  C(r')  shatters  S) 
n—>oo  ^  n  r'\ o  F(S  G  Xk_1  :  limr\0  l[C(r)  shatters  S]  =  1) 

E[l[limr\0  l[C(r)  shatters  S]  =  0]  linv\0  l[C(r')  shatters  S]] 

F(S  G  X1*-1  :  limr\0  l[C(r)  shatters  S]  =  1) 

where  the  second  equality  holds  by  the  monotone  convergence  theorem.  This  proves 

q(n)  =  o(l),  as  claimed.  □ 

Lemma  4.15.  Let  k*  G  Rf  be  the  smallest  index  k  for  which 
P(5  G  Xk~l  :  lim  l[C(r)  shatters  5]  =  1)  >  0  and 

r\ 0 

P(S'  G  Xk~l  :  P(x  :  lim  l[C(r)  shatters  S  U  {x}]  =  1)  =  0|  lim  l[C(r)  shatters  S]  —  1)  >  7. 

r\0  r\0 

Such  a  k*  <  d  +  1  exists,  and\/(  G  (0, 1),  3 s.t.  Vn  >  n^,  ifVxy  G  ‘Realizable^ C)  or 
r  >  ^  +  7 \J ■ U'  —  and  T>xy  G  rBenignNoise(yC),  on  event  Hn  fl  H'n  (defined  above), 
V/c  <  k*, 

P(x  :  r](x)  ^1/2  and¥(S  EX^1  :V(Xth*(x))  does  not  shatter  S\V  shatters  S)  >  Q)  — 

P(x  :  r}(x)  1/2  and  P(SI  G  Xk~l :  Vix  h*(x))  does  not  shatter  S')  lim  t[V(r)  shatters  S]  =  1)  >C) 

r\0 

=  0. 

Proof.  First  we  prove  that  such  a  k*  is  guaranteed  to  exist.  As  mentioned,  by  convention  any 
set  of  classifiers  shatters  {},  and  {}  G  X°,  so  there  exist  values  of  k  for  which  P (S  G  Xk~l  : 
lim  l[C(r)  shatters  S]  —  1)  >  0.  Furthermore,  we  will  see  that  for  any  k  G  {1, . . . ,  d  +  1},  if 

r\0 

this  condition  is  satisfied  for  k,  but 

F(S  G  Xk~x  :  P(x  :  lim  l[C(r)  shatters  S  U  {x}]  =  1)  =  0|  lim  l[C(r)  shatters  S]  —  1)  <  7, 

r\ 0  r\0 

then  P(SI  G  Xk  :  lim  l[C(r)  shatters  S]  —  1)  >  0.  We  prove  this  by  contradiction.  Suppose  the 

r\ 0 

implication  is  not  true  for  some  k.  Then 
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0  <  1  —  7 


<  P (S  G  Xk  1  :  P(x  :  lim  l[C(r)  shatters  S  U  {x}]  =  1)  >  0|  lim  l[C(r)  shatters  S] 

r\ 0  r\0 


=  1) 


<  lim 

e\o 


P (S'  G  Xk  1  :  P(x  :  lim  l[C(r)  shatters  S  U  {x}]  =  1)  >  £) 

r\0 


P(S  G  Xk~x  :  limr^0  l[C(r)  shatters  S]  =  1) 

(by  Markov’s  inequality) 


E[P(x  :  lim  l[C(r)  shatters  S  U  {x}]  =  1)] 

r\0 


<  lim - 

£\o  £P(S  G  Xk~x  :  limr\0  l[C(r)  shatters  S]  =  1) 

P (S'  G  Xk  :  lim  l[C(r)  shatters  S\  =  1) 

=  lim  — 7 - 7 — t - - — — - t - =  lim  0  =  0. 

£\o  £P (S  G  Xk_1  :  limr\0  l[C(r)  shatters  S]  =  1)  £\o 

This  is  a  contradiction,  so  it  must  be  true  that  the  implication  holds  for  all  k.  This  establishes  the 
existence  of  k*,  since  we  definitely  have 


P (S  G  Xd  :  lim  P(x  :  C(r)  shatters  S  U  {x})  =  0|  lim  l[C(r)  shatters  S]  =  1)  =  1  >  7, 

r\0  r\0 

so  that  some  k  satisfies  both  conditions. 

Next  we  prove  the  second  claim.  Take  k  <  k*.  Let  n ^  be  s.t.  supn>n<  q(n)  <  (;  it  must  exist 
since  q(n)  =  o(l).  By  Lemma I4~l4l  for  n  >  n^,  on  Hn  IT  H'n, 

P(x  :  r]{x)  7^  1/2  and  P (S  G  Xk  i :  does  not  shatter  S\V  shatters  S)  >  () 

<  P(x  :  r}{x)  7^  1/2  and 

P(5  G  Xk~l  :  V( x,h*(x))  does  not  shatter  S\  lim  l[C(r)  shatters  S]  —  1)  +  q(n)  >  () 

r\ 0 

<  c._^r^E[l[77(x)  t^1/2]P (S'gT’^"1  :  L(x,fe*(x))  does  not  shatter  S\  lim  l[C(r)  shatters  S']  =  l)] 


r\0 

(by  Markov’s  inequality) 


< 


< 


E[l[lim  l[C(r)  shatters  S]  =  l]F(x:rj(x}/^l/2  and  V(x,h*(x))  does  n°t  shatter  S)] 


(£— q(n))¥(S^Pdk~1:  lim  l[C(r)  shatters  5]  =  1) 

r\,0 

E[l[lim  t[V(r)  shatters  S]  =  l]P(:r:  77  (21)7^1/2  and  V(x,h*(x))  does  not  shatter  S)] 
(£— q(n))¥(S^^k~1 :  lim  l[C(r)  shatters  <5']=1) 

r\,0 


(by  Fubini’s  theorem) 


(by  Lemmal4.13l). 


(4.4) 


For  any  set  S  G  Xk  1  for  which  lim  t[V (r)  shatters  S]  =  1,  there  is  an  infinite  sequence  of  sets 


r\0 


{{hi  \  h^\  . . . ,  (i!,/-] } }?:  wdh  Vj  <  2k  1,  P(x  :  rj(x)  7^  1/2  and  h}-1  (x)  7^  h*(x ))  \  0,  such  that 


,(0 


(0/ 
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each  {/4°,  •  •  • ,  C  V  and  shatters  S.  If  V(x.h-(x))  does  not  shatter  S,  then 

1  =  inf  l[3j  :  hf  f  V(x,h*(x))\  =  inf  1  [3j  :  hf{x)  f  h*(x)\. 

In  particular,  by  Markov’s  inequality, 

P(x  :  r](x)  7^  1/2  and  V(x^h*(xy}  does  not  shatter  S ) 


< 

P(rr  :  rj(x)  f  1/2  and  inf  1  [3j  :  hf 

i  J 

(x)  ±  h*(x)]  =  1) 

< 

E[l[V(X)^l/2]  inf  1  [3j  :  hf(X)  ±  h*(X)]] 

l  J 

< 

inf  P(x  :  7]{x)  f  1/2  and  3j  s.t. 

i  J 

(x)  ±  h*{x)) 

< 

lim  P(x  :  r](x)  f  1/2  and  h ^ 

4k 

* 

O 

j<  2k~1 

This  means  (14.41)  equals  0.  □ 

Lemma  4.16.  Suppose  /cG{l,2,...,d+l}  satisfies 
P(S  G  Xk_1  :  lim  l[C(r)  shatters  S']  =  1)  >  0  and 

r\ 0 

ak  =  P(S  G  Xk~x  :  lim  P(x  :  C(r)  shatters  S  U  {x})  =  0|  lim  l[C(r)  shatters  S]  =  1)  >  7. 

r\ 0  r\0 

Then  there  is  a  function  An1  =  o(l)  such  that,  on  event  II,,  D  H'n  ( defined  above), 

P(x  :  P(S  G  X^1  :  V  shatters  S  U  {x}|(/  shatters  S)  >  1  —  (7  +  ak)/2)  <  An\ 

Proof.  Let 

A  =  {S  G  Xk~1  :  lim  l[C(r)  shatters  S]  =  1  and  limP(x  :  C(r)  shatters  S  U  {x})  =  0}. 

r\ 0  r\0 

Then,  letting  <p(n )  be  as  in  Lemma  l4~TTT  on  event  //„  D  //'  , 

P(x  :  P(S  G  Xkl  :  V  shatters  S  U  {x}|(4  shatters  S)  >  1  —  (7  +  ak)/ 2) 

<  P(x  :  P(S  G  Xk~x  :  C (4>(n))  shatters  S  U  {x}|  lim  l[C(r)  shatters  S]  =  1) 

r\ 0 

+  P(S  G  Xkl  :  lim  l[C(r)  shatters  S]  =  0|V  shatters  S)  >  1  —  (7  +  ak)/ 2)  (4.5) 

r\0 

By  Lemmal4.131  we  know  there  is  some  finite  h\  s.t.  any  n  >  h\  has  (on  event  Hn  ^K) 

P(S  G  Xk~l  :  lim  l[C(r)  shatters  S]  =  0| V  shatters  S)  <  (ak  —  7)/3. 
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We  therefore  have  that,  for  n  >  h\,  on  event  //,,  fl  H'n,  (14.51)  is  at  most 

P(x:P(5'G  Xk~l  :C(0(n))  shatters  SU{a;}|  lim  l[C(r)  shatters  S]  =  l)+(Q!fc-j7)/3 >  1— (7+afc)/2) 

r\0 

<  P(x:P(S'G  :C(0(n))  shatters  S'U{x}|S'G4l)afc  +  (l  — afc)  +  (afc  — 7)/3  >  1  — (7  +  afc)/2) 

=  P(x  :  P(S'  G  :  C(0(n))  shatters  S'  U  {xjlS1  G  .A)  >  (afc  —  7)/(6 ctfc)) 

<  ^^E[P(S'  G  :  C(0(n))  shatters  S'  U  {XUS'  G  /l)]  (by  Markov’s  inequality) 

<  ^^E[P(x  :  C (4>(n))  shatters  S  U  {x})|S  G  A]  (by  Fubini’s  theorem). 

We  will  define  A^  equal  to  this  last  quantity  for  any  n  >  fi\  (we  can  take  A^  =  1  for 
n  <  fii).  It  remains  only  to  show  this  quantity  is  o(l).  Since  Q6“A,E[P(a;  :  C(r)  shatters  S  U 
{x})|S  G  A]  is  monotonic  in  r, 


lim  A(nk) 


lim  ^ CVk  E[P(x  :  C (r)  shatters  S  U  {x})|S  G  A}. 
r\0  —  7 


Since  for  any  S  G  Xk  1 ,  P(x  :  C(r)  shatters  S  U  {.x})  is  monotonic  in  r,  the  monotone  conver¬ 
gence  theorem  implies 


lim  E[P(x  :  C (r)  shatters  S  U  {x})|S'  G  A] 

r\ 0  OLk  ~  7 

n 

=  — E[limP(x  :  C (r)  shatters  S'  U  {a:})  | S'  G  A]  =  0. 

Q!fc  —  7  r\0 


□ 
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Lemma  4.17.  Vn  G  N,  there  is  an  event  //„  C  Hn  fl  H'n  on  Z  that,  if 
T>xy  G  rBenignNoise(C),  has 

P( Hn )  >  1  —  cn4/3  •  exp{— c'n1//3}  —  1  [Dxy  ^  Olealizable^Cfn-1,  for  T>xy-  and 
C-dependent  constants  c,  d  G  (0,  oo),  such  that 

Vn  G  N,  on  Hn , \{x  G  £k*  :  A^\x,U2)  >  1  -  7}|  <  [n/( 3  •  2fc’)J,  (4.6) 

3AA  1  =  o(l)  and  An  ^  =  o(l)  s.t.  Vn  G  N,  on  Hn, 

A {k*\U2)  <  A P  and  A(fc‘}(Wi,  U2)  <  A?*>,  (4.7) 

w/zere  V/c.  A ^(U2)  =  P(x  :  A(fc)(x,  W2)  >  1  —  7);  also  3n*  G  N  57.  Vn  >  n*,  if 
T>xy  £  ‘Jlealizable( C),  ozz  Hn,dx  G  £&*, 

A<**>(z,  W2)  <  1  -  7  =*  f  W2)  <  f(fc*)(x,  h*(x),U2):  (4.8) 


where  Ck*  is  as  in  Meta-Algorithm  5;  also,  Vn  >  n* ,  ifV\Y  G  'BenignN oise( C)  and 


T  >  15  _|_  H.  /  ln(4r;)+n  In  — 


,  then  on  IIn 


[x  :  77(0:)  7^  1/2  azzd  3fc  <  k*  s.t.  A^k\x,  U2)  <  1  —  7  and 

t^k\x,h*(x),U2)  <  f  (k\x,-h*(x),U2))  <  (d+  l)e~c"nl/\  (4.9) 


for  a  C-  and  T>xy -dependent  finite  constant  c"  >  0. 


Proof.  Since  most  of  this  lemma  discusses  only  k  =  k*,  in  the  proof  I  will  simplify  the  notation 
by  dropping  ( k *)  superscripts,  so  that  A(U\  ,U2)  abbreviates  A(/,'"VZ7|  .U2),  T(x,  y,  U2)  abbrevi¬ 
ates  r(k*\x,y,U2),  and  so  on.  I  do  this  only  for  k*,  and  will  include  the  superscripts  for  any 
other  value  of  k  so  that  there  is  no  ambiguity. 

We  begin  with  (l4~6l.  Recall  that  Ck*  is  initially  an  independent  sample  of  size  \nf ( 6  ■ 
2k*A(Ui,  U2))\  sampled  from  VXy[X]  (i.e.,  before  we  add  labels  to  the  examples).  Let  A (U2)  = 
P(x  :  A(x,U2)  >1  —  7). 


Ill 


By  Hoeffding’s  inequality,  on  an  event  ! (U2)  on  U.\  with  P(Zi,  :  HpP (U2))  >  1  —  2  • 
exp{— 2ml/3}  >  1  —  2  •  exp{— 2n1^}, 

|A(W2)  -  d-  E  UpfeWJ  >  1  -  7] I  <  ^73 . 

zGWl  mn 

and  therefore 

A (u2)  <  A (UM). 

By  a  Chemoff  bound,  there  is  an  event  HyP  (U2 )  on  Ck*  and  U\  with 

P(£fc*,Wi:f/(2)(W2))>l-exp{-Ln/(6-2fc*A(W2))jA(W2)/3}  >  l-ea;p{-(n-6-2fc‘)/(18-2fc‘)} 

such  that,  on  an  event  Up 1  (U2)  D  HpP  (U2), 

| {x  G  Ck *  :  A(x,  U2)  >  1  -  7} |  <  2[n/(6  ■  2fc*  A(W2))J  A(W2)  <  n/(3  •  2fc‘). 

Since  the  left  side  of  m  is  an  integer,  (l4~6l)  is  established. 

Next  we  prove  O-  If  k*  =  1,  the  result  clearly  holds.  In  particular,  we  have  A ^(W2)  = 
P(DIS(V)),  and  Hoeffding’s  inequality  implies  that  on  an  event  with  probability 
1  —  exp{— 2ml/3},  AW(Wi,W2)  <  P(DIS(V))  +  2 Combined  with  Lemma  14.161  we 
have  bounds  of  A^  +  2 m^3  =  o(l). 

Otherwise,  we  have  k*  >  2.  In  this  case,  by  Hoeffding’s  inequality  and  a  union  bound  (over 
k  values),  for  an  event  H"  over  U2,  with  P (if")  >  1  —  (d  +  l)exp{— 2 [mn/(k*  —  1) J 1,/3},  on 
H'[  fl  H'n,  for  all  k  G  (2, . . . ,  k*}  (by  Lemma l4~l3l) 

Mk  >  ¥(S  G  Xk~x  :  lim  l[C(r)  shatters  S]  =  l)\mn/{k  —  1)J  —  [mn/(k  —  1)J2^3. 

r\ 0 

Let  us  name  the  right  side  of  this  inequality  mn{n).  Recall  that  for  k  <  k*, 

P (S  G  Xk~x  :  lim  l[C(r)  shatters  S]  =  1)  >  0 

r\ 0 

by  definition  of  k*,  so  m(n )  diverges.  On  event  Hp\lA2), 

A(W, . IA2)  <  A(ZY2)  H - yr  —  A(ZY2)  H — Y/3-  (4-10) 

Win  U 
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Thus,  it  suffices  to  bound  A{U2)  by  a  o(l)  function.  In  fact,  since  we  have  Mk*  lower  bounded 
by  a  diverging  function  on  H"  fl  //'  ,  so  for  sufficiently  large  n,  on  H'n  fl  II", 

A (W2)  <  P(x  :  A (x,U2)  -  M"1/3  >  1  -  (27  +  a)/3). 

Thus,  it  suffices  to  bound  P(x  :  A (x,U2)  —  M k}^3  >  1  —  (27  +  a)/3)  by  a  o(l)  function.  On 

event  Hn  fl  H'n  fl  //",  we  have  that 

P(x  :  A (x,U2)  -  M“1/3  >  1  -  (27  +  a)/ 3) 

<  P(x  :  P(S  G  Xk*1  :  V  shatters  S  U  {x}|y  shatters  S)  >  1  —  (7  +  a)/2)  + 

\m/ (fc*  —  1)J 

P(a;:  |P(S'eT’fc*_1 :  V shatters  S’U{a:}|Vr  shatters  S)  —  1[V shatters  S) U{x}]|  >  (a-jy)/6) 

i=  1 

Bv  Lemma l4T6l  on  event  H„  nH’n, 

P(x  :  P (S  G  :  V  shatters  S  U  {a;} |  V  shatters  5)  >  1  —  (7  +  a)/ 2)  <  =  o(l). 

Thus,  it  suffices  to  prove  the  existence  of  a  o(l)  bound  on 

P(x:  |P(S'GT’fc*”1 :  V shatters  S'U{x}|VA  shatters  S)  —  -Aj  ^-W shatters  S) U{x}]|  >  (a-jy)/6) 

2=1 

For  this,  we  proceed  as  follows.  Define  px  =  -A--  1  1[V  shatters  St  U  {.x}],  a  random 

variable  depending  on  U2,  and  px  =  F(S  G  Xk*~l  :  V  shatters  S  U  {x}  \  V  shatters  S). 


P (U2  :  Mk*  >  m(n )  and  P(x  :  \px  -  px\  >  (a  -  7)/6)  >  Mk}/3) 

<  P  \U2\  Mk*  >  m(n)  and  — — — E[|p\-  —  px |]  >  MPA3  ]  (by  Markov’s  inequality) 

V  « -  7  / 

=  ^2  P {U2  :  Mk*  =  m) P  (W2  :  E[| px  -  px\]  >  m~1/3(a  -  7)/6| Mk*  =  m ) 

m=m(n) 

<  sup  P  {U2  ■  exp{tmmE[\px  -  Px\ ]}  >  exp{tmm2/3(a  -  7)/6}|Mfc*  =  m )  , 

m>m{n) 

for  any  values  tm  >  0.  We  now  proceed  as  in  Chemoff’s  bounding  technique.  By  Markov’s 


113 


inequality,  this  last  quantity  is  at  most 


sup  E[etmmE^px  Px^\Mk*  =  m}exp{  — tmm2^(a  —  7)/6} 

m>m(n) 

<  sup  K\K[etmm^px~px^]\Mk*  =  m\exp{—tmrn2^(a.  —  7)/6}  (by  Jensen  and  Fubini) 

m>m(n) 

<  sup  (sup  +  sup  K[etmmp~trnBm’p])exp{  —  tmm2^(a  —  7)/6} 

m>m(n)  p£[0,l]  p£[0,l] 

where  BmjP  ~  Binomial  (m,  p),  and  the  expectation  is  now  over  By  symmetry,  if  p  is 

the  maximizer  of  the  first  expectation,  then  1  —  p  maximizes  the  second  expectation,  and  the 
maximizing  values  are  identical,  so  this  is  at  most 

2  sup  sup  E[exp{tmBm>p  -  tmmp}\exp{-tmrn2/3(a  -  7)/6)}. 

m>m(n)  pE[0,l] 

Following  the  usual  proof  for  Hoeffding’s  inequality  [see  e.g., 
most 

2  sup  exp{tlnm/8}exp{— tmm2^3(a  —  7) /6)} . 

m>m(n) 

Taking  tm  =  m-1/32(a  —  7)/3,  this  is 
2  sup  exp{m1/3(o!  —  7)2/18  —  m1/32(a  —  7)2/18} 

m>m{n) 

=  2  sup  exp{—  m1^3(a  —  7)2/18}  =  2exp{—  m(n)1^3(a  —  7)2/18}. 

m>m(n) 

Therefore,  there  is  an  event  H"'  on  U2  with 

P (H"')  >  1  —  2 exp{—m{n)1^(a  —  7)2/ 18}  >  1— 

2exp{  —  (P(S'GT’fc*_1 :  lim  l[C(r) shatters S]  =  1)  \nj (k*—  1)J  —  \nj (/c* — 1)J 2//3)1//3(cv — 7)2/18}, 

r\0 

such  that  on  //"'  n  //"  fl  H'n, 

P(x:  |P(S'GT’fc,'_1 :  V  shatters  S’U{a:}|VA  shatters  S')  — 777  Y2  1|X  shatters  Si  U{x}]|  >  (a-^ 7)  /  6) 

i=  1 

<  M-i/3  <  m(n)_1//3  =  o(l). 

Finally,  we  turn  to  (14.81)  and  (14.91).  If  k  —  1,  then  for  T>xy  €  Oiealizable( C),  we  clearly  have 
h*  e  V;  otherwise,  if  D\y  £  'BenignN oise(C)  md  t  >  ^  +  7-yA';  ''7'  then  Lemmal4.12l 


Devrove  et  al.. 


1 99611 .  this  is  at 
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implies  that,  on  an  event  over  Z\_n/z j  of  probability  1  —  1/n,  with  probability  1  over  x  such  that 

rj(x)  7^  1/2,  if  fW(x,  y,U2)  >  f^x,  —  y,W2),  then  y  =  h*(x).  This  implies  (14.81)  for  k*  —  1 
and  it  covers  the  k  —  1  case  for  (14.9b. 

Let  us  now  focus  on  k  >  2  for  (I4.9I).  and  in  particular  k*  >  2  for  both  (14.91)  and  (14.81).  By 
Lemmal4.151  for  any  x  in  a  set  of  probability  1,  Hoeffding’s  inequality  and  a  union  bound  (over 
k  values)  implies  there  is  an  event  H™(x)  with  P(W2  :  H™(x))  >  1  —  (d  +  l)exp{— 2m(n)1^3} 
such  that,  for  n  >  n7/4,  on  the  additional  event  H™{x)  fl  Hn  fl  H'n  D  ///,  if  rj(x)  ^  1/2, 
\/k  e  (2 ,...,k*}. 


1 

M~k 


\mn/(k—l)\ 

E 


1=1 


11  \V(x:h~(x))  does  not  shatter  S(/'  !  and  V  shatters  S^] 


<  P(5  <E  Xk  1  :  V(Xjh*(x))  does  not  shatter  S\V  shatters  S)  +  Mk  1 ^ 

<  7/4  +  Mk1^  <  7/4  +  m(n)“1^3. 


For  sufficiently  large  n,  m(n)  3/3  <  7/4.  If  k  e  {2, ... ,  k*}  and  A ^k\x,U2)  <1  —  7,  then 


1 

A4 


Lmn/(fc— 1)J 

E 


i= 1 


1(V  does  not  shatter  sf  '1  U  {x}  and  V  shatters  5'f  !] 


>  7, 


and  thus,  if  this  happens  for  sufficiently  large  n  on  the  event  II™  (x)  fl  Hn  fl  H’n  fl  ///,  we  must 
have 
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^fW(x, -/!*(!),  W2)  = 

1  \mn/(k-l)\ 

<— —  V  1[V(X:/,,*(X))  does  not  shatter  Sp  and  V  shatters  .Sp] 

Mk  — 

2=1 

<7/2  =  -7/2  +  7 

1  L"W(fc-i)J 

<-7/2  + vr  E  l[L  does  not  shatter  S)-fc)  U  {x}  and  V  shatters  S-  ^} 

Mk  i= 1 

1  Lmn/(fc-l)J 

=  —  7/2  +  — — -  ^[V(x,h*(x))  does  not  shatter  S-k)  and  V  shatters  S^] 

Mk 

2—1 

+  —  V  1  [V(Xih*(x))  shatters  Sf]  and  _h.(x))  does  not] 

Mk 

2—1 

<— —  ^  1  does  not  shatter  Sf:)  and  V  shatters  S^] 

Mk  i~t 

=f-t^(x,h"(x),UX 

By  a  union  bound  over  the  elements  of  Ck*, 

P(W2  :  P|  Hf(x))  >  1  —  nml/3(d  +  l)exp{— 2m(n)^3}, 

x£/lfc* 

which  suffices  to  prove  (Ol). 

Also,  we  have  the  following. 

P(W2  :  P(x  :  H™(x)  does  not  occur)  >  exp{—  m(n)1,/3}) 

<  ea:p{m(n)1//3}E[P(a:  :  Hlf{x)  does  not  occur)]  (by  Markov’s  inequality) 

=  exp{m(n)1^3}E[P(W2  :  Hf(X)  does  not  occur)]  (by  Fubini’s  theorem) 

<  exp{m(n)l/i} E[(d  +  l)exp{— 2m(n)1/3}]  =  (d  +  1  )exp{—m[n)1^}. 

This  suffices  to  prove  (14.91).  □ 

Proof  of  Theorem  14.  j  I  The  result  now  follows  directly  from  Lemmas  14 . 1 71  and  14. 101  (14.71)  im¬ 
plies  |£fc»|  >  L{n )  for  some  function  L(n )  =  c o(n),  while  (I4.6l)implies  we  will  infer  the  labels 
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for  all  but  at  most  \n/ (3  ■  2k*)\  of  them,  and  (14.81)  implies  that,  for  sufficiently  large  n,  the  in¬ 
ferred  labels  are  correct.  Lemma  14 . 1 01  implies  that  erfh)  is  at  most  twice  the  error  of  any  of 
the  d  +  1  classifiers.  These  things  happen  on  an  event  that  only  fails  with  probability  at  most 
exp{— c  ■  n1/*}  for  some  DXY -dependent  constant  c  >  0,  and  a  universal  constant  x  >  0. 

Defining  L~l(m )  =  min{n  :  L(n)  >  m},  we  get  that,  for  some  distribution  over  i  e 
{L(n),  L(n)  +  1, . . .}  (independent  of  the  data), 

E [er(h)\  <  Ez\Ee[2er(Ap(Zi))]\  +exp{— c-n1^}  <  sup  Ez[2er(Ap(Zf))]  +exp{— c-n1^}. 

e>L(n) 

Therefore, 

Aa(3e, T>xy)  <  L  1(Ap(e,VXY))  +  c  MnY -. 

If  Ap(e,  VXy)  3>  1 ,  L  1(Ap(e,  T>Xy))  =  o(Ap(e,VXy)),  so  Ap(e,  T>Xy)  ^  Polylog(  1/e)  im¬ 
plies  the  improvements  claim,  and  otherwise  Aa(e,  T>xy)  G  Polylog(l/e).  □ 


Proof  of  Theorem  \4~4\  This  follows  identical  reasoning  to  the  proof  of  Theorem  14 .31  except  that 
instead  of  adding  exp{— c  ■  n1/*}  to  the  expected  error,  we  simply  take  Aa(2e,  2 S,VXy)  = 
max{L-1(Ap(e,  S,  VXy)),  c~x  lnx(l/<5)}  to  ensure  the  failure  probability  for  the  aforementioned 
events  is  at  most  5.  For  Ap(e,  <5,  T>xy)  1  this  is  effectively  not  a  restriction  at  all  for  small  e, 
and  otherwise  we  still  have  Aa(e,  26,  T>xy)  =  0(  1).  □ 

Lemma  4.18.  Let  h  be  the  classifier  returned  by  Meta-Algorithm  6,  when 
t  >  ^  +  7 \J  L' '  ['  !“  ,  and  T>xy  £  rBenignNoise(C).  Then  for  any  n  G  N,  there  is  some 

£n  =  o(n^1/2)  such  that,  on  an  event  H'n  C  Hn  with  P {H'f}  >  P(^n)  —  5/2, 

er(h)  —  v  <  £„. 


Proof.  For  brevity,  we  introduce  the  notation  Qk  =  {x  :  k'(x)  >  k},  where  as  before  k'(x)  = 
min { k'  :  ASk'\x,U2)  <  1  —7}. 


First  note  that,  by  Alexander’s  results  on  uniform  convergence  [Alexander 


1984 


Devrove  et  al.. 
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1996],  combined  with  a  union  bound,  on  an  event  H"  of  probability  1  —  5/2,  every  h  e  C  has 


Vfc,  \er(h\Qk)  -  erQk(h) \  < 


20485  ln(10245)  +  ln(32(5  +  l)/5) 

\Qk\ 


Define  H'n  =  //,,  n  H",  and  for  the  remainder  of  the  proof  we  assume  this  event  holds.  In 


particular,  this  implies  every  hk  has 


20485  ln(10245)  +  ln(32(5  +  1  )/6) 

\Qk\ 


er(hk\Qk )  <  inf  er(h\Qk )  +  2 


Consider  any  k  <  k*.  We  have  (by  Lemma  14.171) 

er(hk)  =  P(Qfc)er(5fc|Qfc) 

+  P((x,  y)  :  x  £  Qk  and  rj{x)  =  1/2  and  hk(x)  ^  y ) 

+  P((x,  y)  :  x  Qk  and  r)(x)  ^  1/2  and  hk(x)  =  h*(x)  ^  y) 

+  P((x,  y)  :  x  Qk  and  rj(x)  ^  1/2  and  hk(x)  ^  h*(x)  =  y ) 

<  P(Qfc)  ^er(h*\Qk)  +  2 yj 2048d In(1024^+ln(32(d+l)/j);^ 

+  (l/2)P(x  :  x  Qk  and  rj(x)  =  1/2)  + 

P((x,  ?/)  :  x  Qk  and  rj(x)  ^  1/2  and  h*(x)  j -  y)  +  (d  +  l)e~c”nl,i 

<  P(Qfc)  ^er(5*|Qfc)  +  2^/204MIn(1024^+|n(32(d+1)/6)^ 

+  er(h* \X  \  Qfc)P(*  \  Qk)  +  (5  +  l)e~c"nl/3 


< 


Now  there  are  two  cases  to  consider.  In  the  first  case,  k*  <  k.  In  this  case,  we  have 
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er{H)  ~  er(hk*) 


< 


< 


'k*)  yer{h~k |Qfe.)  -  er(hk*\Qk *] 

'**)  ^ erQk*Ch )  ~  erQk*  (hk*)  +  2i 

k*)7\ 


/2048dln(1024d)  +  ln(32(d  +  l)/5)\ 

|<5fc*|  y 


/2048dln(1024d)  +  ln(32(d  +  l)/8) 


\Qi 


Therefore, 


er(L)  -  z/  <  er(hk*)  -  v  + 


< 


< 


< 


07, 


/2048dln(1024d)  +  ln(32(d  +  l)/5) 


l>/(3  ■  2*)J 


P(Q,)9,  2048dln(1024d)  +  ln(32(d  +  1)/i)  +  (i  + 
'  '  ''  [2n/(3-2*)J 


Af)^)9./204M1“<1024d>  +  ln<32<d+1>^  +  (i  +  IK'" 

V  [2ri/ (3  ■  2fe)J  '  ’ 

Xrt-)„  /2048rfln(1024rf)  +  lii(32(rf  +  1)/<J) 

A"  9\/ - [2n/(3.2«)j - +  (d+1)e 


Since  A,0  1  =  o(l)  (by  definition  in  Lemma  14.171).  this  last  quantity  is  o(n  1/2). 


On  the  other  hand,  suppose  k  <  k*.  If  P(Q^)  =  0,  then  the  aforementioned  bound  on  excess 
error  implies  the  result.  Otherwise,  for  k  =  k  +  1,  3j  <  k  such  that 
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r  /2048c? ln(1024d)  +  ln(32(d  +  1) /$) 

V  [2n/ (3  •  2k)\ 

<  erQj(hk)  -  erQj(hj) 

<  erC^IQ,)  -  erihM)  + 

=  P((a,  ?/)  :  /ifc(x)  7^  y  and  77(a)  ^  l/2|Qfc)P(Qfc|Qj) 

+  P((a,  y)  :  ^(a)  7^  y  and  77(a)  ^  1/2  and  x  £  Qfc|a  e  Q j) 

-  P((x,  y)  :  h,(x)  7  y  and  y(x)  7  l/2|z  e  Q,-)  +  2^204MHl024d) +.ln(32(rf  +  lj/j) 

<  P(Qfc|QJ)P((x,  ?/)  :  hfe(a)  7^  y  and  77(a)  ^  l/2|Qfc) 

+  P((a,  y)  :  hk{x)  7^  ?/  and  77(a)  7^  1/2  and  x  £  Qk\x  e  Qj) 

-  P((x,  y)  :  h\x)  7  y  and  V(x)  7  l/2|x  e  Q,-)  +  2^/204Mln(1024ti) +  M32(rf+ !)/j) 

=  P(Qfe|Qj)(er(^fc|Qfc)  -  er(/i*|Qfe)) 

+  P((a,  7/)  :  hk{x)  7^  y  and  77(a)  7^  1/2  and  a  Qfc|a  E  Qj) 


—  P((a,  y)  :  h*(x)  7^  y  and  77(a)  7^  1/2  and  a  </  Qfc|a  e  Qj) 
/ 2048c?  In M  024d0  4-  In r32lr/  4-  1 1  Ml 

+  2\ 


L2t7/(3  •  2>')J 


< 


'  2048c?  ln(  1024c?)  +  ln(32(d  +  1  )/8) 


\_2n/ (3  ■  2fe)J 

+  P(a  :  hk{x)  7^  h*(x)  and  77(a)  7^  1/2  and  a  Qfc)/P(Qj) 


+  2\ 


1 2048c?  ln(1024c?)  +  ln(32(d  +  l)/8) 


\_2n/ (3  •  2*)J 


<  44 


1 2048c? ln(1024c?)  +  ln(32(o?  +  1)/S)  ,  fJ  ,  ^  _c,v/3 

[2^7(372*)]  +(d+  ^ 
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In  particular,  this  implies 


;)  <  (d  +  l)e 


-r”n1/3 


[2n/(3  •  2fc+1)J 


20485  ln(1024d)  +  ln(32(d  +  1) /5) ' 


Therefore, 


er(h)  ~v<  P(Qfc)21 


' 20485 ln(10245)  +  ln(32(5  +  l)/5)  _c„ni/3 

l>/P  •  2*)j  +(  +  >e 


<  (1  +  V2)(d  +  l)e~c"nl/3  =  o(n~1/2). 


□ 

Proof  of  Theorem  15.51  This  result  now  follows  directly  from  Lemma  14.181  That  is,  for  suffi¬ 
ciently  large  n  (say  n  >  s,  for  some  s  E  N),  P (Hn)  <  <5/2,  so  with  probability  1  —  5, 
er(h )  —  v  <  £n.  We  can  define  Z'n  =  1  for  n  <  s,  and  £n  for  n  >  s.  Then  we  have  for 
all  n,  with  probability  1  —  5,  er(h)  —  v  <  £'n  =  o{n~ P2).  Thus,  the  algorithm  obtains  a  label 
complexity 

Aa(e  +  v,S,VXY )  <  1  +  supnl[£^  >  e]. 

neN 

Now  define  £"  =  £^  +  2~n  =  o(n_1//2).  Then 

lim  e2Aa(e  +  i/,  5,  DXy)  < 

€\0 

< 

Therefore,  Aa(e  +  u,  5,  VXy)  =  o(l/e2),  as  claimed.  □ 


lime2(l  +  supnl[£"  >  e]) 

e\°  ngN 

lime2  sup  nl[£"  >  e] 

nSN,n>Llog2(l/e)J 

(£'A2 

lim  e2  sup  n ^  n 

nSN,n>Llog2(l/e)J  ^ 

lim  sup  7i(£^)2 

eN0  ngN,n>Llog2(l/e)J 

lim  sup  n(£")2  =  f  lim  sup  y/nE.'^\  =  0. 

n— >-oo  \  n— kx)  / 
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Chapter  5 


Beyond  Label  Requests:  A  General 
Framework  for  Interactive  Statistical 
Learning 


In  this  chapter,  I  describe  a  general  framework  in  which  a  learning  algorithm  is  tasked  with  learn¬ 
ing  some  concept  from  a  known  class  by  interacting  with  a  teacher  via  questions.  Each  question 
has  an  arbitrary  known  cost  associated  with  it,  which  the  learner  is  required  to  pay  in  order  to 
have  the  question  answered.  Exploring  the  information-theoretic  limits  of  this  framework,  I  de¬ 
fine  a  notion  called  the  cost  complexity  of  learning,  analogous  to  traditional  notions  of  sample 
complexity.  I  discuss  this  topic  for  the  Exact  Learning  setting  as  well  as  PAC  Learning  with  a 
pool  of  unlabeled  examples.  In  the  former  case,  the  learner  is  allowed  to  ask  any  question,  while 
in  the  latter  case,  all  questions  must  concern  the  target  concept’s  behavior  on  a  set  of  unlabeled 
examples.  In  both  settings,  I  derive  upper  and  lower  bounds  on  the  cost  complexity  of  learning, 
based  on  a  combinatorial  quantity  I  call  the  General  Identification  Cost. 
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5.1  Introduction 


The  ability  to  ask  questions  to  a  knowledgeable  teacher  can  make  learning  easier.  This  fact  is  no 
secret  to  any  elementary  school  student.  But  how  much  easier?  Some  questions  are  more  difficult 
for  the  teacher  to  answer  than  others.  How  much  inconvenience  must  even  the  most  conscientious 
learner  cause  to  a  teacher  in  order  to  learn  a  concept?  This  chapter  explores  these  and  related 
questions  about  the  fundamental  advantages  and  limitations  of  learning  by  interaction. 

In  machine  learning  research,  it  is  becoming  increasingly  apparent  that  well-designed  inter¬ 
active  learning  algorithms  can  provide  valuable  improvements  in  learning  performance  while 
reducing  the  amount  of  effort  required  of  a  human  annotator.  This  research  has  mainly  focused 
on  two  formal  settings  of  learning:  Exact  Learning  by  queries  and  pool-based  Active  PAC  Learn¬ 
ing.  Informally,  the  objective  in  the  setting  of  Exact  Learning  by  queries  is  to  perfectly  identify 
a  target  concept  (classifier)  by  asking  questions.  In  contrast,  the  pool-based  Active  PAC  setting 
is  concerned  only  with  approximating  the  concept  with  high  probability  with  respect  to  an  un¬ 
known  distribution  on  the  set  of  possible  instances.  In  this  latter  setting,  the  learning  algorithm 
is  restricted  to  asking  only  questions  that  relate  to  the  concept’s  behavior  on  a  particular  set  of 
unannotated  instances  drawn  independently  from  the  unknown  distribution. 

In  this  chapter,  I  study  both  of  these  active  learning  settings  under  a  broad  definition.  Specif¬ 
ically,  I  consider  a  learning  protocol  in  which  the  learner  can  ask  any  question,  but  each  possible 
question  has  an  associated  cost.  Lor  example,  a  query  of  the  form  “what  is  the  label  of  example 
x”  might  cost  $1,  while  a  query  of  the  form  “show  me  a  positive  example”  might  cost  $10.  The 
objective  is  to  learn  the  concept  while  minimizing  the  total  cost  of  queries  made.  One  would  like 
to  know  how  much  cost  even  the  most  clever  learner  might  be  required  to  pay  to  learn  a  concept 
from  a  particular  concept  space  in  the  worst  case.  This  can  be  viewed  as  a  generalization  of 
notions  of  sample  complexity  or  query  complexity  found  in  the  learning  theory  literature.  I  refer 
to  this  best  worst  case  cost  as  the  cost  complexity  of  learning.  This  quantity  is  defined  without 
reference  to  computational  feasibility,  focusing  instead  on  the  information-theoretic  boundaries 
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of  this  setting  (in  the  limit  of  unbounded  computation).  Below,  I  derive  bounds  on  the  cost  com¬ 
plexity  of  learning,  as  a  function  of  the  concept  space  and  cost  function,  for  both  Exact  Learning 
from  queries  and  pool-based  Active  PAC  Learning. 


Section  15721  formally  introduces  the  setting  of  Exact  Learning  from  queries,  describes  some 
related  work,  and  defines  cost  complexity  for  that  setting.  It  also  serves  to  introduce  the  notation 
and  fundamental  definitions  used  throughout  this  chapter.  The  section  closely  parallels  the  work 


of  B alcazar  et  al.  IBalcazar  et  al 


20011.  The  primary  contribution  of  Section  I5T2I  is  a  derivation 


of  upper  and  lower  bounds  on  the  cost  complexity  of  Exact  Learning  from  queries.  This  is 
followed,  in  Section  l5~3l  by  a  formal  definition  of  pool-base  Active  PAC  Learning  and  extension 
of  the  notion  of  cost  complexity  to  that  setting.  The  primary  contributions  of  Sect  ion  15731  include 
a  derivation  of  upper  and  lower  bounds  on  the  cost  complexity  of  learning  in  that  general  setting, 
as  well  as  an  interesting  corollary  for  intersection-closed  concept  spaces.  I  know  of  no  previous 
work  giving  general  results  of  this  type. 


5.2  Active  Exact  Learning 


In  this  setting,  there  is  an  instance  space  X  and  concept  space  C  on  X  such  that  any  h  e  C  is 


a  distinct  function  h  :  X 


{0.1} 


j  Additionally,  define  C*  =  {h  :  X  — >  {0, 1}}.  That  is, 


C*  is  the  most  general  concept  space,  containing  all  possible  labelings  of  X.  In  particular,  any 
concept  space  C  is  a  subset  of  C*.  Lor  a  particular  learning  problem,  there  is  an  unknown  target 
concept  f  e  C,  and  the  task  is  to  identify  /  using  a  teacher’s  answers  to  queries  made  by  the 


0 


learning  algorithm.  Lormally,  an  actual  query  is  any  function  in  Q  =  {q  :  C*  — ►  2A*  \  {0}} 
for  some  answer  set  A* .  By  a  learning  algorithm  “making  an  actual  query”,  I  mean  that  it  selects 


'All  of  the  main  results  easily  generalize  to  multiclass  as  well. 

2The  restriction  that  q{f)  ^  {}  is  a  bit  like  an  assumption  that  every  valid  question  has  at  least  one  answer  for 
any  target  concept.  However,  we  can  always  define  some  particular  answer  to  mean  “there  is  no  answer,”  so  this 
restriction  is  really  more  of  a  notational  convenience  than  an  assumption. 
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a  function  q  E  Q,  passes  it  to  the  teacher,  and  the  teacher  returns  a  single  answer  a  E  q(f  ) 
where  /  is  the  target  concept.  A  concept  h  E  C*  is  consistent  with  an  answer  a  to  an  actual 
query  q  if  a  E  q(h).  Thus,  I  assume  the  teacher  always  returns  an  answer  that  the  target  concept 
is  consistent  with;  however,  when  there  are  multiple  such  answers,  the  teacher  may  arbitrarily 
select  from  amongst  them. 

Traditionally,  the  subject  of  active  learning  has  been  studied  with  respect  to  specific  restricted 
query  types,  such  as  membership  queries,  and  the  learning  algorithm’s  objective  has  been  to 
minimize  the  number  of  queries  used  to  learn.  However,  it  is  often  the  case  that  learning  with 
these  simple  types  of  queries  is  difficult,  but  if  the  learning  algorithm  is  allowed  just  a  few  special 
queries,  learning  becomes  significantly  easier.  The  reason  we  are  initially  reluctant  to  allow  the 
learner  to  ask  certain  types  of  queries  is  that  these  queries  are  difficult,  expensive,  or  sometimes 
impossible  to  answer.  However,  we  can  incorporate  this  difficulty  level  into  the  framework  by 
assigning  each  query  type  a  specific  cost ,  and  then  allowing  the  learning  algorithm  to  explicitly 
optimize  the  cost  needed  to  learn,  rather  than  the  number  of  queries.  In  addition  to  allowing  the 
algorithm  to  trade  off  between  different  types  of  queries,  this  also  gives  us  the  added  flexibility  to 
specify  different  costs  within  the  same  family  (e.g.,  perhaps  some  membership  queries  are  more 
expensive  than  others). 

Formally,  in  this  framework  there  is  a  cost  function.  Let  a  >  0  be  a  constant.  A  cost 
function  is  any  c  :  Q  — ►  (a,  oo] .  In  practice,  c  would  typically  be  defined  by  the  user  responsible 
for  answering  the  queries,  and  could  be  based  on  the  time,  resources,  or  operating  expenses 
necessary  to  obtain  the  answer.  Note  that  if  a  particular  type  of  query  is  unanswerable  for  a 
particular  application,  or  if  the  user  wishes  to  work  with  a  reduced  set  of  possible  queries,  one 
can  always  define  the  costs  of  those  undesirable  query  types  to  be  oo,  so  that  any  reasonable 
learning  algorithm  ignores  them  if  possible. 

While  the  notion  of  actual  query  closely  corresponds  to  the  actual  mechanism  of  querying  in 
practice,  it  will  be  more  convenient  to  work  with  the  information-theoretic  implications  of  these 
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queries.  Define  the  set  of  effective  queries  Q  =  {q  :  C*  — ■>  22<L  \  {0}| V/  G  C *,a  G  g(/)  =>• 
[/  G  a  A  V/i  G  a,  a  G  g(/t)]}.  Each  effective  query  corresponds  to  an  equivalence  class  of  actual 
queries,  defined  by  mapping  any  answer  to  the  set  of  concepts  consistent  with  it.  We  can  thus 
define  the  mapping 


£(q)  —  {<?[<?  G  Q,  V/  G  C*,  [3a  G  q(f)  with  a  =  {h\h  G  <C*,  a  G  g(/i)}]  OaG  <?(/)}• 


D 


By  an  algorithm  “making  an  effective  query  q”  I  mean  that  it  makes  an  actual  query  in  £ (q)  Jj  (a 
good  algorithm  will  pick  a  cheaper  actual  query).  For  the  purpose  of  this  best-worst-case 
analysis,  the  following  definition  is  appropriate.  For  a  cost  function  c,  define  a  corresponding 
effective  cost  function  (overloading  notation)  c  :  Q  — >  [a,  oo],  such  that 
Vg  G  Q ,  c(g)  =  inf^^)  c(g).  The  following  definitions  illustrate  how  query  types  can  be 
defined  using  effective  queries. 

A  positive  example  query  is  any  q  G  £{qs)  for  some  S  C  X,  such  that  qs  G  Q  is  defined  by 
V/  G  C*  s.t.  [3x  G  S  :  f(x)  =  1  ],qs(f)  =  {{h\h  G  C\  h(x)  =  l}\x  G  S  :  f(x)  =  1},  and 
V/  G  C*  s.t.  [Vx  G  S,  f{x)  =  0],  qs(f)  =  {{h\h  G  C*  :  Vx  G  S,  h(x)  =  0}}. 

A  membership  query  is  any  q  G  £{q{x})  for  some  x  G  X.  This  special  case  of  a  positive 
example  query  can  equivalently  be  defined  by  V/  G  c*,  g{*}(/)  =  {{h\h  g  e,  Mx)  =  /(x)}}. 
These  effectively  correspond  to  asking  for  any  example  labeled  1  in  S'  or  an  indication  that  there 
are  none  (positive  example  query),  and  asking  for  the  label  of  a  particular  example  in  X 
(membership  query).  I  will  refer  to  these  two  query  types  in  subsequent  examples,  but  the 
reader  should  keep  in  mind  that  the  theorems  below  apply  to  all  types  of  queries. 

Additionally,  it  will  be  useful  to  have  a  notion  of  an  effective  oracle,  which  is  an  unknown 
function  defining  how  the  teacher  will  answer  the  various  queries.  Formally,  an  effective  oracle 
T  is  any  function  in  T  =  {T  :  Q  — ■»  2c*|Vg  G  Q,T(q)  G  U/6c* <?(/)}  -  For  convenience,  I  also 

3I  assume  A*  is  sufficiently  expressive  so  that  \/q  £  Q,  £{q)  f  0;  alternatively,  we  could  define  £ (q)  =  0  =$- 

c(q)  =  oo  without  sacrificing  the  main  theorems.  Additionally,  I  will  assume  that  it  is  possible  to  find  an  actual 

query  in  £(q)  with  cost  arbitrarily  close  to  mf~e£(ri)  c(q)  for  any  q  £  Q  using  finite  computation. 

4An  effective  oracle  corresponds  to  a  deterministic  stateless  teacher,  which  gives  up  as  little  information  as 
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overload  this  notation,  defining  for  a  set  of  queries  R  C  Q,  T(R)  —  r\q£nT(q). 


Definition  5.1.  A  learning  algorithm  A  for  C  using  cost  function  c  is  any  algorithm  which,  for 
any  ( unknown )  target  concept  f  G  C,  hy  a  finite  number  of  finite  cost  actual  queries,  is 
guaranteed  to  reduce  the  set  of  concepts  in  C  consistent  with  the  answers  to  precisely  {/}.  A 
concept  space  C  is  leamable  with  cost  function  c  using  total  cost  t  if  there  exists  a  learning 
algorithm  for  C  using  c  guaranteed  to  have  the  sum  of  costs  of  the  queries  it  makes  at  most  t. 


Definition  5.2.  For  any  instance  space  X,  concept  space  C  on  X,  and  cost  function  c,  define 
the  cost  complexity,  denoted  CostComplexity(C,  c),  as  the  infimum  t  >  0  such  that  C  is 
.Jearnable  with  cost  function  c  using  total  cost  no  greater  than  t. 

jEquivalently,  we  can  define  cost  complexity  using  the  following  recurrence.  If  |C|  =  1, 
CostComplexity( C,  c)  =  0.  Otherwise, 

CostComplexity(C:  c)  =  inf  c(q)  +  max  CostComplexity({h\h  G  C,  a  G  q(h)},c) 

q&Q  f£C,a£q(f) 


Since 


inf  c(q)  +  max  CostComplexity({h\h  G  C,  a  G  q(h)},  c ) 

q£Q  f£<C,a£q(f) 

=  inf  inf  c(q)  +  max  CostComplexityi C  fl  {h\h  G  C*,  a  G  q(h)},  c ) 

<?eSge£((j)  f£C,a£q(f) 

=  inf  c(q)  +  max  CostComplexity(C  D  a,  c), 
q£Q  /ec,aeg(/) 

we  can  equivalently  define  cost  complexity  in  terms  of  effective  queries  and  effective  cost.  That 

is,  CostComplexity( C,  c)  is  the  infimum  t  >  0  such  that  there  is  an  algorithm  guaranteed  to 

identify  any  /  G  C  using  effective  queries  with  total  of  effective  costs  no  greater  than  t. 

possible.  It  is  also  possible  to  analyze  a  setting  in  which  asking  two  queries  from  the  same  equivalence  class,  or 

asking  the  same  question  twice,  can  possibly  lead  to  two  different  answers.  However,  the  worst  case  in  both  settings 

is  identical,  so  the  worst  case  results  obtained  for  this  setting  also  apply  to  the  more  general  case. 

5I  have  made  the  dependence  of  A  on  the  teacher  implicit.  To  be  formally  correct,  A  should  have  the  teacher’s 

effective  oracle  T  as  input,  and  is  guaranteed  to  output  /  for  any  T  £  T  s.t.  \/q  £  Q,  T{q)  £  q(f)-  Cost  is  then  a 

book-keeping  device  recording  how  A  uses  T  during  execution. 
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5.2.1  Related  Work 


There  have  been  a  relatively  large  number  of  contributions  to  the  study  of  Exact  Learning  from 
queries.  In  particular,  much  interest  has  been  given  to  settings  in  which  the  learning  algorithm  is 
restricted  to  a  few  specific  types  of  queries  (e.g.  membership  queries  and  equivalence  queries). 
However,  these  contributions  focus  entirely  on  the  number  of  queries  needed,  rather  than  cost. 


The  most  relevant  work  in  this  area  is  by  Balcazar,  Castro,  and  Guijarro  [Balcazar  et  al 


Prior  to  publication  of  IBalcazar  and  Castro , 


200  ill. 


20021.  there  were  a  variety  of  publications  in 


which  the  learning  algorithm  could  use  some  specific  set  of  queries,  and  which  derived  bounds 
on  the  number  of  queries  any  algorithm  might  be  required  to  make  in  the  worst  case  in  order  to 


learn.  For  example,  [Hellerstein  et  al 


proper  equivalence  queries,  [Hegediis, 


queries  alone,  while  IBalcazar  et  al. , 


m 


19961  analyzed  the  combination  of  membership  and 


1995!]  additionally  analyzed  learning  from  membership 


19991  considered  learning  from  just  proper  equivalence 


queries.  Amidst  these  various  special  case  analyses,  somewhat  surprisingly,  Balcazar  et  al. 


[Balcazar  and  Castro 


2002]  discovered  that  the  query  complexity  bounds  derived  in  these 


works  were  all  special  cases  of  a  single  general  theorem,  applying  to  the  broad  class  of 


sample-based  queries.  They  further  generalized  this  result  in  [Balcazar  et  al 


2001],  giving 


results  that  apply  to  any  combination  of  any  query  types.  That  work  defines  an  abstract 
combinatorial  quantity,  which  they  call  the  General  Dimension ,  which  provides  a  lower  bound 
on  the  query  complexity,  and  is  within  a  log  factor  of  it.  Furthermore,  the  General  Dimension 
can  actually  be  computed  for  a  variety  of  interesting  combinations  of  query  types.  Until  now 
there  has  not  been  any  analysis  I  know  of  that  considers  learning  with  all  query  types,  but  giving 
each  query  a  cost,  and  bounding  the  worst-case  cost  that  a  learning  algorithm  might  be  required 
to  incur.  In  particular,  the  analysis  of  the  next  subsection  can  be  viewed  as  a  generalization  of 


IBalcazar  et  al. 


20011  to  add  this  notion  of  cost,  such  that  [Balcazar  et  al 


20011  represents  the 


special  case  of  cost  that  is  uniformly  1  on  a  particular  set  of  queries  and  oo  on  all  other  queries. 
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5.2.2  Cost  Complexity  Bounds 


I  now  turn  to  the  subject  of  exploring  the  fundamental  limits  of  interactive  learning  in  terms  of 


cost.  This  discussion  closely  parallels  that  of  Balcazar,  Castro,  and  Guijarro  [Balcazar  et  al 


2001]. 


Definition  5.3.  For  any  instance  space  X,  concept  space  C  on  X,  and  cost  function  c,  define 
the  General  Identification  Cost,  denoted  GIC( C,  c),  as  follows. 

GIC(C,c)  =  inf{t|t  >  0,VT  G  T,  3R  C  Q,s.t.\ZqeRc(q)  <  t]  A  [|CGT(i2)|  <  1]} 

We  can  also  express  this  as  GIC( C,  c)  =  supTgT  infRcQ:|cnr(ij)|<i  J2qeR  °{q)-  Note  that 
calculating  this  corresponds  to  a  much  simpler  optimization  problem  than  calculating  the  cost 
complexity.  The  General  Identification  Cost  is  a  direct  generalization  of  the  General  Dimension 


of  [Balcazar  et  al 


Dimension  IHegediia 


Certificate  Sizes  of  IIHellerstein  et  al 


20011,  which  itself  generalizes  quantities  such  as  Extended  Teaching 


19951,  Strong  Consistency  Dimension  IIBalcazar  et  al. 


19991].  and  the 


19961.  It  can  be  interpreted  as  a  sort  of  game.  This  game 


is  similar  to  the  usual  setting,  except  that  the  teacher’s  answers  are  not  restricted  to  be  consistent 
with  a  concept.  Imagine  there  is  a  helpful  spy  who  knows  precisely  how  the  teacher  will 
respond  to  every  query.  The  spy  is  able  to  suggest  queries  to  the  learner,  and  wishes  to  cause  the 
learner  to  pay  as  little  as  possible.  If  the  spy  is  sufficiently  clever  at  suggesting  queries,  and  the 
learner  follows  every  suggestion  by  the  spy,  then  after  asking  some  minimal  cost  set  of  queries 
the  learner  can  narrow  the  set  of  concepts  in  C  consistent  with  the  answers  down  to  at  most  one. 
The  General  Identification  Cost  is  precisely  the  worst  case  limiting  cost  the  learner  might  be 
forced  to  pay  during  this  process,  no  matter  how  clever  the  spy  is  at  suggesting  queries. 

Lemma  5.4.  For  any  instance  space  X,  concept  space  C  on  X,  and  cost  function  c,  if  V  C  C, 
then  GIC{ V,  c )  <  GIC{ C,  c). 


Proof  It  clearly  holds  if  GIC( C,  c)  =  oo.  If  GIC( C,  c)  <  k,  then  VT  G  T,  3R  C  Q  s.t. 

J2q£R  c(<7)  <  k  and  1  >  |C  fl  T(R) |  >  \  V  fl  T(R)  |,  and  therefore  GIC(V,  c )  <  k.  The  limit  as 
k  — >  GIC( C,  c)  gives  the  result.  □ 
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Lemma  5.5.  For  any  7  >  0,  instance  space  X,  finite  concept  space  C  on  X  with  |C|  >1,  and 
cost  function  c  such  that  GIC{  C,  c)  <  00,  3g  G  Q  such  that  VT  G  T, 

\C\T(q)\  >  c(q)  1  . 

GIC((L,  c)  +  7 

7/?///  is,  regardless  of  which  answer  the  teacher  picks,  there  are  at  least  c(q)  arcf(Cc)+':  concepts 
in  C  inconsistent  with  the  answer. 

Proof  Suppose  Vg  G  Q,  3 Tq  G  T  such  that  |C  \  Tq(q)  \  <  Then  define  an 

effective  oracle  T  with  the  property  that  Vg  G  Q,  T(q )  =  Tq(q).  We  have  thus  defined  an  oracle 
such  that  Vi?  C  Q,  JfqeR  c(<?)  <  GIC( C,  c)  +  7  =7- 


C  n  T(7?)|  =  |C|  -  C  \  T(R)\  >  |C|  -  Y,  |C  \  T,{q)\ 

q&R 

|C|-1 


> 


|C|-^c(g) 


q&R 


GIC{  C,  c)  +  7 


>  |C|  -  (GIC(C,  c)  +  7) 


|C| 


G/C(C,  c)  +  7 


=  1. 


In  particular,  this  contradicts  the  definition  of  GIC( C,  c). 


□ 


This  brings  us  to  the  main  theorem  of  this  section. 

Theorem  5.6.  For  any  instance  space  X,  concept  space  C  on  X,  and  cost  function  c, 

GIC( C,  c)  <  CostComplexity{C ,  c)  <  GIC{ C,  c)  log2  |C| 

Proof  I  begin  with  the  lower  bound.  Let  k  <  GIC( C,  c).  By  definition  of  GIC,  3 T  G  T,  such 
that  Vi?  C  Q,  c(g)  <  k  =>  |C  fl  T(iZ)|  >  1.  In  particular,  this  implies  that  an  adversarial 

teacher  can  answer  any  sequence  of  queries  with  cost  no  greater  than  k  in  a  way  that  leaves  at 
least  2  concepts  in  C  consistent  with  the  answers,  either  of  which  could  be  the  target  concept  /. 
This  implies  CostComplexity( C,  c)  >  k.  The  limit  as  k  — >  GIC( C,  c)  gives  the  bound. 

Next  I  prove  the  upper  bound.  If  GIC( C,  c)  =  00  or  |C|  =  00,  the  bound  holds  vacuously,  so 
let  us  assume  these  are  finite.  Say  the  teacher’s  answers  correspond  to  some  effective  oracle 
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T  e  T .  Consider  a  recursive  algorithm  A-  that  makes  effective  queries  from  Qu  If  |C|  =  1, 
then  A7  halts  and  outputs  the  single  remaining  concept.  Otherwise,  let  q  be  an  effective  query 
having  the  property  guaranteed  by  Lemma  1531  That  is,  |C  \  T(q)\  >  c(q )  qj^ITcVt '  I^fm^g 
V  =  C  fl  T(q)  (a  generalized  notion  of  version  space),  this  implies  that 
c(q)  <  {GIC( C,  c)  +  7)  ^  ^  and  \V\  <  |C|.  Say  A7  makes  effective  query  q,  and  then 
recurses  on  V.  In  particular,  we  can  immediately  see  that  this  algorithm  identifies  /  using  no 
more  than  |C|  —  1  queries. 

I  now  prove  by  induction  on  |C|  that  CostComplexity( C,  c)  <  (GIC( C,  c)  +  7)L/|q-i,  where 
Hn  =  Y!i= 1  \ is  the  nth  harmonic  number.  If  |C|  =  1,  then  the  cost  complexity  is  0.  For 
|C|  >  1, 

CostComplexity( C,  c) 


<c(q)  +  CostComplexityiy,  c ) 

<(G/C(C,  c)  +  +  (GIC(V,  c)  +  7)ff|v|-i 

<(GIC(C,c)  +  7)  +  JT|v|-i) 

<(G/C(C,c)  +  7)if|q_1 

where  the  second  inequality  uses  the  inductive  hypothesis  along  with  the  properties  of  q 
guaranteed  by  Lemma 331  and  the  third  inequality  uses  Lemma 13741  Finally,  noting  that 
iT|C|— 1  <  log2  |C|  and  taking  the  limit  as  7  — >  0  proves  the  theorem.  □ 


One  interesting  implication  of  this  proof  is  that  the  greedy  algorithm  that  chooses  q  to  maximize 
min  has  a  cost  complexity  within  a  log2  |C|  factor  of  optimal. 

6I  use  the  definition  of  cost  complexity  in  terms  of  effective  cost,  so  that  we  need  not  concern  ourselves  with 
how  .A7  chooses  its  actual  queries.  However,  we  could  define  A-,  to  make  actual  queries  with  cost  within  7  of  the 
effective  query  cost,  so  that  the  result  still  holds  as  7  — >  0. 
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5.2.3  An  Example:  Discrete  Intervals 


As  a  simple  example  of  cost  complexity,  consider  X  =  {1,  2, . . . ,  N},  for  N  >  4, 

C  =  {hayb  :  X  — ■»  {0,  l}|a,  b  e  X,  a  <  b,  Vx  e  X,  [a  <  x  <  b  <^>  ha)b{x )  =  1]},  and  define  an 
effective  cost  function  c  that  is  1  for  membership  queries  qsx\  for  any  x  e  X,  k  for  the  positive 
example  query  qx  where  3  <  k  <  N  —  1,  and  oo  for  any  other  queries.  In  this  case, 

GIC{ C,  c)  =  k  +  1.  In  the  spy  game,  say  the  teacher  answers  effective  queries  with  an  effective 
oracle  T.  Let  X+  =  {x|x  <G  X ,  T(qyxy)  =  {h\h  E  C*,  h(x)  =  1}}.  If  X+  ^  0,  then  let 
a  =  min  X+  and  b  =  max  X+ .  The  spy  tells  the  learner  to  make  queries  q{„},  q{b},  q\n- \  \  (if 
a  >  1),  and  q{b+i}  (if  b  <  N ).  This  narrows  the  version  space  to  {ha,b},  at  a  worst-case  effective 
cost  of  4.  If  X+  =  0,  then  the  spy  suggests  query  qx-  If  T(qx)  =  {/-}>  the  “all  0”  concept, 
then  no  concepts  in  C  are  consistent.  Otherwise,  T(qx)  =  {h\h  e  C*,  h(x)  =  1}  for  some 
x  G  X,  and  the  spy  suggests  membership  query  q^xy.  In  this  case,  T{qyxy)  fl  T{qx)  =  0,  so  the 
worst-case  cost  is  k  +  1  (without  qx,  it  would  cost  N  —  1).  These  are  the  only  cases  to  consider, 
so  GIC( C,  c)  =  k  +  1.  By  Theorem  15 .61  this  implies 
k  +  1  <  CostComplexity( C,  c)  <  2(k  +  1)  log2  N. 

We  can  slightly  improve  this  by  noting  that  we  only  use  qx  once.  Specifically,  if  a  learning 

algorithm  begins  (in  the  regular  setting)  by  asking  qx ,  revealing  that  f{x)  =  1  for  some  x  G  X, 

then  we  can  reduce  to  two  disjoint  learning  problems,  with  concept  spaces 

C(  =  {hX}b\b  e  {x, . . . ,  N}},  and  C'2  =  {ha,x\a  e  (1,  2, ... ,  x}},  with  cost  functions 

ci(q)  =  c(q)  for  q  6  {g{3;},  q{x+ 1}, . . . ,  q{N}}  and  oo  otherwise,  and  c2(g)  =  c(q)  for 

q  e  {f7{i} ,  qy 2},  •  •  • ,  q{x}}  and  oo  otherwise,  and  corresponding  GIC( C(,  c)  <  2, 

GIC( C2,  c)  <  2.  So  we  can  say  that 

Cos  l  Comp  l  exi  ly  ( C ,  c)  <k  +  CostComplexity (C(,  c±)  +  Cost  Complexity  (C2 ,  c2)  <  k  +  4  log2  Ar. 
One  algorithm  that  achieves  this  begins  by  making  the  positive  example  query,  and  then 
performs  binary  search  above  and  below  the  indicated  positive  example  to  find  the  boundaries. 
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5.3  Pool-Based  Active  PAC  Learning 


In  many  scenarios,  a  more  realistic  definition  of  learning  is  that  supplied  by  the  Probably 
Approximately  Correct  (PAC)  model.  In  this  case,  unlike  the  previous  section,  we  are  interested 
only  in  discovering  with  high  probability  a  function  with  behavior  very  similar  to  the  target 
concept  on  examples  sampled  from  some  distribution.  Formally,  as  above  there  is  an  instance 
space  X,  and  a  concept  space  C  C  C*  on  X\  unlike  above,  there  is  also  a  distribution  V  over  X . 
As  with  Exact  Learning,  the  learning  algorithm  interacts  with  a  teacher  by  making  queries. 
However,  in  this  setting  the  learning  algorithm  is  given  as  input  a  finite  scqucncqj  of  unlabeled 
examples  U,  each  drawn  independently  according  to  V.  and  all  queries  made  by  the  algorithm 
must  concern  only  the  behavior  of  the  target  concept  on  examples  in  U. Formally,  a 
data-dependent  cost  function  is  any  function  c  :  Q  x  2*  — >•  (a,  oo].  For  a  given  set  of  unlabeled 
examples  U,  and  data-dependent  cost  function  c,  define  cuf)  =  c(-,U).  Thus,  cu  is  a  cost 
function  in  the  sense  of  the  previous  section.  For  a  given  cu,  the  corresponding  effective  cost 
function  cu  '■  Q  — >  [a,  oo]  is  defined  as  in  the  previous  section. 


Definition  5.7.  Let  X  be  an  instance  space,  C  a  concept  space  on  X,  and  U  =  (aq,  x2, . . . ,  x\u\) 
a  finite  sequence  of  unlabeled  examples.  Define  \/h  e  C,  h(U)  =  {h{x  i),  h(x2), . . . ,  h(x\u\)). 
Define  C \U\  C  C  as  any  concept  space  such  that\Hi  e  C,  \{h'\h'  6  C \U\,  h'(U)  =  h(U)}\  =  1. 


7I  will  implicitly  overload  all  notation  for  sets  and  sequences,  so  that  if  a  set  is  used  where  a  sequence  is  required, 
then  an  arbitrary  ordering  of  the  set  is  implied  (though  this  ordering  should  be  used  consistently),  and  if  a  sequence 
is  used  where  a  set  is  required,  then  the  set  of  distinct  elements  of  the  sequence  is  implied. 
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Definition  5.8.  A  sample-based  cost  function  is  any  data-dependent  cost  function  c  such  that 
for  all  finite  U  C  ,f,Vg  6  Q, 

Cu(q)  <  oo  =>-  V/  G  C*,  Va  G  g(/),  V/i  G  C*,  |7i(W)  =  f(U)  =>•  h  G  a]. 

This  corresponds  to  queries  that  are  about  //?<?  target  concept  ’s  labels  on  some  subset  ofU. 
Additionally,  MU  C  AG  .1;  G  AG  anc/  g  G  Q,  c(q,U  U  {x})  <  c(q,U).  That  is,  in  addition  to  the 
above  property,  adding  extra  examples  to  which  q ’s  answers  do  not  refer  does  not  increase  its 

cost. _ 

For  example,  membership  queries  on  x  G  U  and  positive  examples  queries  on  S  C  U  could 
have  finite  costs  under  a  sample-based  cost  function.  As  in  the  previous  section,  there  is  a  target 
concept  /  G  C,  but  unlike  that  section,  we  do  not  try  to  identify  /,  but  instead  attempt  to 
approximate  it  with  high  probability. 

Definition  5.9.  For  instance  space  X,  concept  space  C  on  X,  distribution  V  on  X,  target 
concept  f  G  C,  and  concept  h  G  C,  define  the  error  rate  ofh,  denoted  errorx>{h,  /),  as 

_ errorv(h,  f)  =  Vrx~y  {h{X)  f  /(X)} _ 

Definition  5.10.  For  (e,  5)  G  (0,  l)2,  an  (e,  ()) -learning  algorithm  for  C  using  sample-based  cost 
function  c  is  any  algorithm  A  taking  as  input  a  finite  sequence  of  unlabeled  examples,  such  that 
for  any  target  concept  /  G  C  and  finite  sequence  U,  A{U)  outputs  a  concept  in  C  after  making 
a  finite  number  of  actual  queries  with  finite  costs  under  cu ■  Additionally,  any  (e,  ())-lcarning 
algorithm  A  has  the  property  that  3 m  G  [0,  oo)  such  that,  for  any  target  concept  f  G  C  and 
distribution  T>  on  X, 

Vru~v™  { errorv(A(U ),  /)  >  e}  <  5. 

A  concept  space  C  is  (e,  cij-lcarnable  given  sample-based  cost  function  c  using  total  cost  t  if 
there  exists  an  (e,  5)-learning  algorithm  A  for  C  using  c  such  that  for  all  finite  example 
sequences  U,  A(U )  is  guaranteed  to  have  the  sum  of  costs  of  the  queries  it  makes  at  most  t 
under  cu- 
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Definition  5.11.  For  any  instance  space  X,  concept  space  C  on  X,  sample-based  cost  function 
c,  and  (e,  5)  G  (0,  l)2,  define  the  (e,  <5)-cost  complexity,  denoted  CostComplexity(C}  c,  e,  5),  as 

the  infimum  t  >  0  such  that  C  is  (e,  5)-learnable  given  c  using  total  cost  no  greater  than  t. 

As  in  the  previous  section,  because  it  is  the  limiting  case,  we  can  equivalently  define  the 

(e,  <5)-cost  complexity  as  the  infimum  t  >  0  such  that  there  is  an  (e,  5)-leaming  algorithm 
guaranteed  to  have  the  sum  of  effective  costs  of  the  effective  queries  it  makes  at  most  t. 

The  main  results  from  this  section  include  a  new  combinatorial  quantity  GPIC( C,  c,  m,  r) 
such  that  if  d  is  the  VC-dimension  of  C,  then 

GPIC( C,  c,  ©(7),  5)  <  CostComplexity( C,  c,  e,  5)  <  GPIC{ C,  c,  0  (7)  ,  O)0(d). 


5.3.1  Related  Work 


Previous  work  on  pool-based  active  learning  in  the  PAC  model  has  been  restricted  almost 
exclusively  to  uniform-cost  membership  queries  on  examples  in  the  unlabeled  set  U.  There  has 
been  some  recent  progress  on  query  complexity  bounds  for  that  restricted  setting.  Specifically, 


Dasgupta  [Dasgupta. 


20041  analyzes  a  greedy  active  learning  scheme  and  derives  bounds  for  the 


number  of  membership  queries  in  U  it  uses  under  an  average  case  setting,  in  which  the  target 
concept  is  selected  randomly  from  a  known  distribution.  A  similar  type  of  analysis  was 


previously  given  by  Freund  et  al.  |  Freund  et  al 


1 997 1  to  prove  positive  results  for  the  Query  by 


Committee  algorithm.  In  a  subsequent  paper,  Dasgupta  [Dasgupta, 


2QQ5]  derives  upper  and 


lower  bounds  on  the  number  of  membership  queries  in  U  required  for  active  learning  for  any 
particular  distribution  V,  under  the  assumption  that  V  is  known.  The  results  I  derive  in  this 
section  imply  worst-case  results  (over  both  V  and  /)  for  this  as  a  special  case  of  more  general 
bounds  applying  to  any  sample -based  cost  function. 


5.3.2  Cost  Complexity  Upper  Bounds 

I  now  derive  bounds  on  the  cost  complexity  of  pool-based  Active  PAC  Learning. 
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Definition  5.12.  For  an  instance  space  X,  concept  space  C  on  X,  sample-based  cost  function  c, 
and  nonnegative  integer  m,  define  the  General  Identification  Cost  Growth  Function,  denoted 
GIC(  C,  c,  m),  as  follows. 

GIC(C,  c,  m)  =  sup  GIC(C[U\,cu) 
uex™ 


Definition  5.13.  For  any  instance  space  X,  concept  space  C  on  X,  and  (e,  5)  E  (0,  l)2,  let 
M( C,  e,  <5)  denote  the  sample  complexity  of  C  (in  the  classic  passive  learning  sense),  or  the 
smallest  m  such  that  there  is  an  algorithm  A  taking  as  input  a  set  of  examples  C  and  labels,  and 
outputting  a  classifier  (without  making  any  queries),  such  that  for  any  T>  and  f  E  C, 


-Px-t 


(l(C,f(C))J)>e}<6. 


It  is  known  (e.g.,  [Anthony  and  Bartlett. 


1999|])  that 


max{ 


32e  ’  2e 


Ini}  <  M(C,e,<5)  <  -  In  12 


+  ->l 


for  0<e<l/8,  0  <  <5  <  .01,  and  d  >  2,  where  d  is  the  VC-dimension  of  C.  Furthermore, 


Warmuth  has  conjectured  |  Warmuth. 


20041]  that  M{ C,  e,  6)  =  0(±(d  +  log  ±)). 


With  these  definitions  in  mind,  we  have  the  following  novel  theorem. 


Theorem  5.14.  For  any  instance  space  X,  concept  space  C  on  X  with  VC-dimension 
d  E  (0,  oo),  sample-based  cost  function  c,  e  E  (0, 1),  and  5  E  (0,  |),  ifm  =  M(C,  e,  5),  then 
CostComplexity(C,  c ,  e,  5)  <  GIC( C,  c,  m)d\og2  ^ 


Proof  For  the  unlabeled  sequence,  sampled  ~  Vm.  If  GIC( C,  c,  m)  =  oo,  then  the  upper 
bound  holds  vacuously,  so  let  us  assume  this  is  finite.  Also,  d  E  (0,  oo)  implies  \U\  E  (0,  oo) 


I  Anthony  and  Bartlett. 


19991.  By  definition  of  M(C,  e,  5),  there  exists  a  (passive  learning) 


algorithm  A  such  that  V/  E  C,  VT>,  Vru^Vm{errorv(A(U ,  f(U )),  f)>e}<  5.  Therefore  any 
algorithm  that,  by  a  finite  sequence  of  effective  queries  with  finite  cost  under  cjj,  identifies  f(U) 
and  then  outputs  A(U,  f(U)),  is  an  (e,  <5)-learning  algorithm  for  C  using  c. 

Suppose  now  that  there  is  a  ghost  teacher,  who  knows  the  teacher’s  target  concept  f  E  C.  The 
ghost  teacher  uses  the  h  E  C[W]  with  h(U)  =  f(U)  as  its  target  concept.  In  order  to  answer  any 
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actual  queries  q  G  Q  with  cu(q)  <  oo,  the  ghost  teacher  simply  passes  the  query  to  the  real 
teacher  and  then  answers  the  query  using  the  real  teacher’s  answer.  This  answer  is  guaranteed  to 
be  valid  because  cu  is  a  sample -based  cost  function.  Thus,  identifying  fill )  can  be 
accomplished  by  identifying  h(U),  which  can  be  accomplished  by  identifying  h.  The  task  of 
identifying  h  can  be  reduced  to  an  Exact  Learning  task  with  concept  space  C[U\  and  cost 
function  cu,  where  the  teacher  for  the  Exact  Learning  task  is  the  ghost  teacher.  Therefore,  by 
Theorem EU  the  total  cost  required  to  identify  fiU)  with  a  finite  sequence  of  queries  is  no 
greater  than 


CostComplexity(C[U],  cu)  <  GIC(C[U\,  cu)  log2  |C[W]|  <  GIC(C\U\,  c^)dlog2  (5.1) 

LI 


where  the  last  inequality  is  due  to  Sauer’s  Lemma  (e.g.,  [Anthony  and  Bartlett, 


1999I11.  Linally, 


taking  the  worst  case  (supremum)  over  all  U  G  Xm  completes  the  proof. 


□ 


Note  that  (15.11)  also  implies  a  data-dependent  bound,  which  could  potentially  be  useful  for 
practical  applications  in  which  the  unlabeled  examples  are  available  when  bounding  the  cost.  It 
can  also  be  used  to  state  a  distribution-dependent  bound. 


5.3.3  An  Example:  Intersection-Closed  Concept  Spaces 


As  an  example  application,  we  can  use  the  above  theorem  to  prove  new  results  for  any 


intersection-closed  concept  spacqjas  follows. 


8 An  intersection-closed  concept  space  C  has  the  property  that  for  any  hi,  h,2  £  C,  there  is  a  concept  hi  £  C 
such  that  Va;  £  X,  [h\{x)  =  h2{x)  =  1  <£>  hz(x)  =  1].  For  example,  conjunctions  and  axis-aligned  rectangles  are 
intersection-closed. 
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Lemma  5.15.  For  any  instance  space  X,  intersection-closed  concept  space  C  with 
VC-dimension  d  >  1,  sample-based  cost  function  c  such  that  membership  queries  in  U  have 
cost  <  /i  (i.e.,  did  C  X,  x  E  Id,  cu(q{x})  <  p)  and  positive  example  queries  in  U  have  cost  <  k 
(i.e.,  did  C  X ,  S  C  Id,  cu(qs)  <  n),  and  integer  m  >  0, 

GIC{ C,  c,  m)  <  k  +  pd 

Proof.  Say  we  have  some  set  of  unlabeled  examples  U,  and  consider  bounding  the  value  of 
GIC(C[ld\,  cu).  In  the  spy  game,  suppose  the  teacher  is  answering  with  effective  oracle  T  E  T. 
Let  U+  =  {x\x  E  U,  T(g{x|)  =  {h\h  E  C*,  h(x )  =  1}}.  The  spy  first  tells  the  learner  to  make 
the  qu\u+  query  (if  U  \  U+  f  0).  If  3x  G  U  \  U+  s.t.  T(qu\u+)  =  {h\h  G  C*,  h(x )  =  1},  then 
the  spy  tells  the  learner  to  make  effective  query  q/x  j  for  this  x,  and  there  are  no  concepts  in 
C [W]  consistent  with  the  answers  to  these  two  queries;  the  total  effective  cost  for  this  case  is 
k  +  p.  If  this  is  not  the  case,  but  \U+\  =  0,  then  there  is  at  most  one  concept  in  C [W]  consistent 
with  the  answer  to  qu\u+ '■  namely,  the  h  G  C \U\  with  h(x)  =  0  for  all  x  eU,  if  there  is  such  an 
h.  In  this  case,  the  cost  is  just  k. 

Otherwise,  let  S'  be  a  largest  subset  of  U+  such  that  3h  G  C  with  Vx  E  S,  h(x )  =  1.  If  S  =  0, 
then  making  any  membership  query  in  U+  leaves  all  concepts  in  C [ld\  inconsistent  (at  cost  p), 
so  let  us  assume  5^0.  For  any  S  C  X,  define 

CLOS(S)  =  {x\x  exyhe  C,  [\/y  G  S,  h(y)  =  1]  =►  h{x)  =  1} 

the  closure  of  S.  Let  S'  be  a  smallest  subset  of  S  such  that  CLOSES')  =  CLOS(S),  known  as 
a  minimal  spanning  set  of  S  [Helmbold  et  al.,  1990].  The  spy  now  tells  the  learner  to  make 
queries  q{x}  for  all  x  G  S'. 

Any  concept  in  C  consistent  with  the  answer  to  qu\u+  must  label  every  x  G  Id  \  ld+  as  0.  Any 
concept  in  C  consistent  with  the  answers  to  the  membership  queries  on  S'  must  label  every 
x  G  C  LOS  (S')  =  CLOSES)  0  S  as  1.  Additionally,  every  concept  in  C  that  labels  every 
x  G  S  as  1  must  label  every  x  G  U+  \  S  as  0,  since  S  is  defined  to  be  maximal.  This  labeling  of 
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these  three  sets  completely  defines  a  labeling  of  U ,  and  as  such  there  is  at  most  one  h  E  C[U\ 
consistent  with  the  answers  to  all  queries  made  by  the  learner.  Helmbold,  Sloan,  and  Warmuth 
[Helmbold  et  ah,  1990]  proved  that,  for  an  intersection-closed  concept  space  with 
VC-dimension  d,  for  any  set  S,  all  minimal  spanning  sets  of  S  have  size  at  most  d.  This  implies 
the  learner  makes  at  most  d  membership  queries  in  U,  and  thus  has  a  total  cost  of  at  most 
k  +  jid.  □ 


Corollary  5.16.  Under  the  conditions  of  Lemma\5 . 1 5\  ifd  >  10,  then  for  0  <  e  <  1,  and 

0  <  8  < 


CostComplexity(C,  c,  e,  5)  <  (k  +  pd)d  log2 


e 

-  max 
d 


16  d 
e 


In  d,  -  In  — 

e  o 


Proof.  This  follows  from  Theorem  15 .141  Lemma  15.151  and  Auer  &  Ortner’s  result 


I  Auer  and  Ortnei . 


20041  that  for  intersection-closed  concept  spaces  with  d  >  10, 


M (C,  e,  8)  <  max  { In  d,  |  In  ^  }  . 


□ 


For  example,  consider  the  concept  space  of  axis-parallel  hyper-rectangles  in  X  =  Rn, 

C  =  {h  :  A  -»•  {0,  l}|3((ai,  bf),  (a2,  b2),  •  •  • ,  (an,  bn))  :  Mx  G  Mn,  h{x)  =  1  <^>  Vz  G 
{1,2,...,  n},  di  <  Xi  <  bj,}.  One  can  show  that  this  is  an  intersection-closed  concept  space 
with  VC-dimension  2  n.  For  a  sample -based  cost  function  c  of  the  form  stated  in  Lemmal5.151 
we  have  that  CostComplexity(C ,  c,  e:,  6)  <  0  ((k  +  np)n).  Unlike  the  example  in  the  previous 
section,  if  all  other  query  types  have  infinite  cost,  then  for  n  >  2  there  are  distributions  that 
force  any  algorithm  achieving  this  bound  for  small  e  and  5  to  use  multiple  positive  example 
queries  qs  with  ,S'|  >  1.  In  particular,  for  finite  constant  k,  this  is  an  exponential  improvement 
over  the  cost  complexity  of  PAC  active  learning  with  only  uniform  cost  membership  queries  on 
U. 
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5.3.4  A  Cost  Complexity  Lower  Bound 


At  first  glance,  it  might  seem  that  GIC{ C,  c,  |~ )  could  be  a  lower  bound  on 
CostComplexity( C,  c,  e,  5).  In  fact,  one  can  show  this  is  true  for  <5  <  (f  ld-  However,  there  are 
simple  examples  for  which  this  is  not  a  lower  bound  for  general  e  and  5  We  therefore  require  a 


slight  modification  of  GIC  to  introduce  dependence  on  5. 

Definition  5.17.  For  an  instance  space  X,  finite  concept  space  C  on  X,  cost  function  c,  and 
5  G  [0, 1),  define  the  General  Partial  Identification  Cost,  denoted  GPIC{  C,  c,  5)  as  follows. 
GPIC(C,c,6)  =  inf{t|t  >  0, VT  G  T,3R  C  Q,  s.t.  [£,eflc(g)  <  t]A[|CGT(f2)|  <  5|C|  +  1]} 

Definition  5.18.  For  an  instance  space  X,  concept  space  C  on  X,  sample-based  cost  function 
c,  non-negative  integer  m,  and  5  G  [0, 1),  define  the  General  Partial  Identification  Cost  Growth 
Function,  denoted  GPIC(C,  c ,  m,  5),  as  follows. 

GPIC(C,  c,  m,  8)  —  sup  GPIC(C[U],cu,6) 
u&x™ 

It  is  easy  to  see  that  GIC( C,  c)  =  GPIC( C,  c,  0)  and  GIC( C,  c,  m)  =  GPIC( C,  c,  m,  0),  so 

that  all  of  the  above  results  could  be  stated  in  terms  of  GPIC. _ 

Theorem  5.19.  For  any  instance  space  X,  concept  space  C  on  X,  sample-based  cost  function 

c,  (e,  5)  G  (0,  l)2,  and  any  fCC, 

GPIC(V,c,  |~^~|  ,5)  <  CostComplexity(C,  c,  e,  5) 

Proof.  Let  S  C  X  be  a  set  with  1  <  \S\  <  |~i^] ,  and  let  Vs  be  the  uniform  distribution  on  S. 
Thus,  error  t>s(h,  f)  <  e  <^>  h(S)  =  f(S).  I  will  show  that  any  algorithm  A  guaranteeing 
Vru^v™{errorx>s{A(U) ,  f)>e}<6  cannot  also  guarantee  cost  strictly  less  than 
GPIC(V[S],cs,  5)-  If  S\V[S}\  >  iHtS1]!  —  1,  the  result  is  clear  since  no  algorithm  guarantees 
cost  less  than  0,  so  assume  A | W [S']  |  <  \V[S] \  —  1.  Suppose  A  is  an  algorithm  that  guarantees, 

yThe  infamous  “Monty  Hall”  problem  is  an  interesting  example  of  this.  For  another  example,  consider  X  = 
{1, 2, . . . ,  N},  C  =  {hx\x  £  X, \/y  €  X1  hx(y)  =  I[x  =  y]},  and  cost  that  is  1  for  membership  queries  in  U  and 
infinite  for  other  queries.  Although  G/C(C,  c,  N )  =  N  —  1,  it  is  possible  to  achieve  better  than  e  =  with 
probability  close  to  using  cost  no  greater  than  N  —  2. 
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for  every  finite  sequence  U  of  elements  from  S,  A(U)  incurs  total  cost  strictly  less  than 
GPIC(V[S\,  cs ,  5)  under  cu  (and  therefore  also  under  cs).  By  definition  of  GPIC,  37  G  T 
such  that  for  any  set  of  queries  R  that  A (U)  makes,  \  V[S]  fl  T(R)\  >  (5 1  [S']  |  +  1.  I  now 
proceed  by  the  probabilistic  method.  Say  the  teacher  draws  the  target  concept  /  uniformly  at 
random  from  V[S],  and  \/q  G  Q  s.t.  /  G  T{q),  answers  with  T(q).  Any  q  G  Q  such  that 
/  ^  T(q)  can  be  answered  with  an  arbitrary  a  G  q(f).  Let  hu  =  A(U);  let  Ru  denote  the  set  of 
queries  A(U)  would  make  if  all  queries  were  answered  with  T. 

E f[Vru~v™{errorVs(A(U),  f)  >  e}] 


=E u~vz[Prf{hu(S)  ^  f(S)}] 

>E u~v?[Prf{hu(S)  ±  f{S)  A  /  G  f(Ru)}} 


.  \V[S}nT(Ru)\-l  . 

>  min  1  1  J  ,T.rlc1  im - > 

ues ™  |13[S^]  | 

Therefore,  there  exists  a  deterministic  method  for  selecting  /  and  answering  queries  such  that 
Pru^v^{errorVs(A(U) ,  f)>e}>  5.  In  particular,  this  proves  that  there  are  no  (e,  c))-learning 
algorithms  that  guarantee  cost  strictly  less  than  GPIC(V[S],  cs,  5).  Taking  the  supremum  over 
sets  S  completes  the  proof.  □ 


Corollary  5.20.  Under  the  conditions  of  Theorem  15. 1 91 

GPIC(C,  c,  |~^]  >  ^)  —  CostComplexity(C,c,e ,  5). 

Equipped  with  Theorem  15.191  it  is  straightforward  to  prove  the  claim  made  in  Section l5~.3. 31  that 

there  are  distributions  forcing  any  (e,  ci) -learning  algorithm  for  Axis-parallel  rectangles  using 
only  membership  queries  (at  cost  p)  to  pay  OfLAAi  ^  details  are  left  as  an  exercise. 


5.4  Discussion  and  Open  Problems 

Note  that  the  usual  “query  counting”  analysis  done  for  Active  Learning  is  a  special  case  of  cost 
complexity  (uniform  cost  1  on  the  allowed  queries,  infinite  cost  on  the  others).  In  particular, 
Theorem  15 . 1 41  can  easily  be  specialized  to  give  a  worst-case  bound  on  the  query  complexity  for 
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the  widely  studied  setting  in  which  the  learner  can  make  any  membership  queries  on  examples 


in  U  [Dasgupta, 


20051.  However,  for  this  special  case,  one  can  derive  a  slightly  tighter  bound. 


Following  the  proof  technique  of  Hegediis  [Hegedus. 


199511.  one  can  show  that  for  any 


sample-based  cost  function  c  such  that  \/lL  C  X,qeQ, 

Cu(q)  <  OO  =»  [cu(q)  =  1  A  V/  G  C*,  \q(f)  \  =  1],  CostComplexity(C,  cx)  <  ' 


This  implies  for  the  PAC  setting  that  CostComplexity(C ,  c,  e,6)  <  2 


GIC(C,c,m)d log 2  m 
log 2  GIC(C,c,m) 


,  for 


VC-dimension  d  >  3  and  m  =  M(C,  e,  5).  This  includes  the  cost  function  assigning  1  to 
membership  queries  on  U  and  oo  to  all  others. 

Active  Learning  in  the  PAC  model  is  closely  related  to  the  topic  of  Semi-Supervised  Learning. 


Balcan  &  Blum  [Balcan  and  Blum, 


2005]  have  recently  derived  a  variety  of  sample  complexity 


bounds  for  Semi-Supervised  Learning.  Many  of  the  techniques  can  be  transfered  to  the 
pool-based  Active  Learning  setting  in  a  fairly  natural  way.  Specifically,  suppose  there  is  a 
quantitative  notion  of  “compatibility”  between  a  concept  and  a  distribution,  which  can  be 
estimated  from  a  finite  unlabeled  sample.  If  we  know  the  target  concept  is  highly  compatible 
with  the  data  distribution,  we  can  draw  enough  unlabeled  examples  to  estimate  compatibility, 
then  identify  and  discard  those  concepts  that  are  probably  highly  incompatible.  The  set  of 
highly  compatible  concepts  may  be  significantly  less  expressive,  therefore  reducing  both  the 
number  of  examples  for  which  an  algorithm  must  leam  the  labels  to  guarantee  generalization 
and  the  number  of  labelings  of  those  examples  the  algorithm  must  distinguish  between,  thereby 
also  reducing  the  cost  complexity. 

There  are  a  variety  of  interesting  extensions  of  this  framework  worth  pursuing.  Perhaps  the 


most  natural  direction  is  to  move  into  the  agnostic  PAC  framework,  w 


lich  has  thus  far  been 


quite  elusive  for  active  learning  except  for  a  few  results  [  Balcan  et  al 


2006,  Kaariainen,  2005]. 


Another  possibility  is  to  derive  cost  complexity  bounds  when  the  cost  c  is  a  function  of  not  only 
the  query,  but  also  the  target  concept.  Then  every  time  the  learning  algorithm  makes  a  query  q, 
it  is  charged  c(q,  /),  but  does  not  necessarily  know  what  this  value  is.  However,  it  can  always 
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upper  bound  the  total  cost  so  far  by  the  worst  case  over  concepts  in  the  version  space.  Can 
anything  interesting  be  said  about  this  setting  (or  variants),  perhaps  under  some  benign 
smoothness  constraints  on  c(q,  •)?  This  is  of  some  practical  importance  since,  for  example,  it  is 
often  more  difficult  to  label  examples  that  occur  near  a  decision  boundary. 
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